Random Sampling on Big Data: Techniques and Applications Ke Yi

Size: px
Start display at page:

Download "Random Sampling on Big Data: Techniques and Applications Ke Yi"

Transcription

1 : Techniques and Applications Ke Yi Hong Kong University of Science and Technology

2 Big Data in one slide The 3 V s: Volume Velocity Variety Integers, real numbers Points in a multi-dimensional space Records in relational database Graph-structured data 2

3 Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: Distributed/parallel systems Simpler programming models MapReduce, Pregel, Dremel, Spark BSP Failure tolerance and recovery Drop certain features: ACID, CAP, nosql This talk is not about this approach! 3

4 Downsizing data A second approach to computational scalability: scale down the data! A compact representation of a large data set Too much redundancy in big data anyway What we finally want is small: human readable analysis / decisions Necessarily gives up some accuracy: approximate answers Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach Can scale out computation and scale down data at the same time Algorithms need to work under new system architectures Good old RAM model no longer applies 4

5 Outline for the talk Simple random sampling Sampling from a data stream Sampling from distributed streams Sampling for range queries Not-so-simple sampling Importance sampling: Frequency estimation on distributed data Paired sampling: Medians and quantiles Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice 5

6 Simple Random Sampling Sampling without replacement Randomly draw an element Don t put it back Repeat s times Sampling with replacement Randomly draw an element Put it back Repeat s times Trivial in the RAM model The statistical difference is very small, for n s 6

7 Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications Data stored on disk Network traffic 7

8 8

9 Reservoir Sampling Maintain a sample of size s drawn (without replacement) from all elements in the stream so far Keep the first s elements in the stream, set n s Algorithm for a new element n n + 1 With probability s/n, use it to replace an item in the current sample chosen uniformly at random With probability 1 s/n, throw it away Perhaps the first streaming algorithm [Waterman??; Knuth s book] 9

10 Correctness Proof 10 By induction on n n = s: trivially correct Assume each element so far is sampled with probability s/n Consider n + 1: The new element is sampled with probability s n+1 Any element in the current sample is sampled with probability s 1 s + s s 1 = s. Yeah! n n+1 n+1 s n+1 This is a wrong (incomplete) proof Each element being sampled with probability s is not a sufficient n condition of random sampling Counter example: Divide elements into groups of s and pick one group randomly

11 11

12 Reservoir Sampling Correctness Proof Many proofs found online are actually wrong They only show that each item is sampled with probability s/n Need to show that every subset of size s has the same probability to be the sample Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12

13 Sampling from Distributed Streams 13 One coordinator and k sites Each site can communicate with the coordinator Goal: Maintain a random sample of size s over the union of all streams with minimum communication Difficulty: Don t know n, so can t run reservoir sampling algorithm Key observation: Don t have to know n in order to sample! [Cormode, Muthukrishnan, Yi, Zhang, PODS 10, JACM 12] [Woodruff, Tirthapura, DISC 11]

14 Reduction from Coin Flip Sampling Flip a fair coin for each element until we get 1 An element is active on a level if it is 0 If a level has s active elements, we can draw a sample from those active elements Key: The coordinator does not want all the active elements, which are too many! Choose a level appropriately 14

15 The Algorithm Initialize i 0 In round i: Sites send in every item w.p. 2 i (This is a coin-flip sample with prob. 2 i ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to advance to round i i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample 15

16 Communication Cost of Algorithm Communication cost of each round: O(k + s) Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log n) In each round, need Θ(s) items being sampled to end round Each item has prob. 2 i to contribute: need Θ(2 i s) items Total communication: O( k + s log n) Can be improved to O(k log k/s n + s log n) A matching lower bound Sliding windows 16

17 Random Sampling for Range Queries 17 [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD 15 Best Demo Award]

18 Online Range Sampling Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination. Naïve solutions: Parameters: n: data size q: query size s: sample size (not known beforehand) n q s Query then sample: O f n + q Sample then query: O sn q (store data in random order) New solution: O f sn q + s f(x): # canonical nodes in tree of size x, between log x and x 18 [Wang, Christensen, Li, Yi, VLDB 16]

19 Indexing Spatial Data Numerous spatial indexing structures in the literature R-tree 19

20 RS-tree Attach a sample to node u drawn from leaves below u Total space: O(n) Construction time: O(n) 20

21 RS-tree: A 1D Example Report: 5 5 Active nodes

22 RS-tree: A 1D Example Report: 5 5 Active nodes

23 RS-tree: A 1D Example Report: Active nodes Pick 7 or 14 with equal prob

24 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 14 with prob. 1:1:

25 RS-tree: A 1D Example Report: Active nodes

26 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 12 with equal prob

27 Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible

28 Frequency Estimation on Distributed Data Given: A multiset S of n items drawn from the universe [u] For example: IP addresses of network packets S is partitioned arbitrarily and stored on k nodes Local count x ij : frequency of item i on node j Global count y i = x ij j Goal: Estimate y i with additive error εn for all i Can t hope for relative error for all y i Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM 11] 28

29 Frequency Estimation: Standard Solutions Local heavy hitters Let n j = i x ij be the data size at node j Node j sends in all items with frequency εn j Total error is at most εn j = εn j Communication cost: O(k/ε) Simple random sampling A simple random sample of size O(1/ε 2 ) can be used to estimate the frequency of any item with error εn Extra log factor for all items Algorithm Coordinator first gets n j for all j Decides how many samples to get from each j Get the samples from the nodes Communication cost: O(k + 1/ε 2 ) 29

30 Importance Sampling Horvitz Thompson estimator: X ij = x ij g x ij 0 if x ij sampled else Estimator for global count y i : Y i = X i,1 + + X i,k 30

31 Importance Sampling: What is a good g(x)? Natural choice: g 1 x = k εn x More precisely: g 1 x = max k εn x, 1 Can show: Var Y i = O εn 2 for any i Communication cost: O k/ε This is (worst-case) optimal Interesting discovery: g 2 x = g 1 x 2 Also has Var Y i = O εn 2 for any i Also has communication cost O k/ε in the worst case But can be much lower than g 1 (x) on some inputs 31

32 g 2 x is Instance-Optimal 32

33 Median and Quantiles (order statistics) Exact quantiles: F 1 ( ) for 0 < < 1, F : CDF Approximate version: tolerate answer between F 1 ( ) F 1 ( + ) 33

34 Estimating Median by Random Sampling Simple random sampling An ε-approximation needs a sample of size Θ(1/ε 2 ) Paired Sampling Divide data into chunks of size s = O(1/ε) Sort each chunk Do binary merges into one chunk Each merge takes odd-positioned or even-positioned elements with equal probability Similar ideas used in discrepancy methods This needs O(n log s) time, how is it useful?

35 Application 1: Streaming Computation Can merge chunks up as items arrive in the stream At any time, keep at most O log n chunks Space: O(1/ε log n) Can be improved to O(1/ε log(1/ε)) by combining with random sampling Can find all quantiles [Felber, Ostrovsky, 15] Reservoir sampling needs O(1/ε 2 ) space Best deterministic algorithm needs O(1/ε log n) space [Greenwald, Khana, 01] 35 [Wang, Luo, Yi, Cormode, SIGMOD 13]

36 Application 2: Distributed Data Data stored on k nodes Each node reduces its data to O 1/ε k using paired sampling, and send to coordinator The coordinator reduces all the data received to a size of O(1/ε) using paired sampling Communication cost: O( k/ε) Looks familiar? 36

37 Generalization: -approximations A sample that preserves density of point sets For any range (e.g., a circle), fraction of sample points fraction of all points ε Simple random sample needs size O(1/ε 2 ) Paired sampling yields size O(1/ε 2d/(d+1) ) 37

38 ε-approximations on distributed data 38 [Huang, Yi, FOCS 14]

39 Database Workloads Transactional (OLTP) Deduct x dollars from account A, credit x dollars to account B Challenge: Efficiency and correctness (ACID) Analytical (OLAP) Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness? 39 Wander Join: Online Aggregation via Random Walks

40 Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 40 Wander Join: Online Aggregation via Random Walks

41 Online Aggregation [Haas, Hellerstein, Wang, SIGMOD 97] SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) Y + ε Y Y ε FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME CONFIDENCE 95 REPORTINTERVAL 1000 Confidence interval: Pr Y ε < Y < Y + ε > Wander Join: Online Aggregation via Random Walks

42 Ripple Join [Haas, Hellerstein, SIGMOD 99] Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion Join with sampled tuples from other tables Works well for full Cartesian product But most joins are sparse 42 Wander Join: Online Aggregation via Random Walks

43 A Running Example Nation CID BuyerID OrderID OrderID ItemID Price US 1 UK $500 China 5 UK 8 Japan 9 UK What s the total revenue of all orders US China 3 from customers 1 in 3 China? 5 5 N: size of each table, e.g., 10 9 US n: # tuples 6 taken from 5 each 6 table China s: # estimators, e.g., n N9 2 = s n = N 2/3 s 1/3 = $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

44 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 44 Wander Join: Online Aggregation via Random Walks

45 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 45 Wander Join: Online Aggregation via Random Walks

46 Join as a Graph Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

47 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

48 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

49 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

50 Sampling by Random Walks Nation CID BuyerID OrderID US US 2 China 3 UK 4 China 5 UK 8 Japan 9 UK OrderID ItemID Price $ $100 N: size of each table size, e.g., n: # tuples taken from each table = # random walks s: # estimators, e.g., n = s 5 = 10 3 US China Unbiased estimator: 5 8 $500 = sampling prob $ $ $ $ $ $200 $ $100 1/3 1/4 1/ $ Wander Join: Online Aggregation via Random Walks

51 Walk Plan Optimization Structure of the data graph Selection predicates Starting table: use index Table in the middle: reject random walk Data distribution Non-uniformity may not be a bad thing! R 1 R Var R 1 R 2 < Var R 2 R 1 R 1 R 2 R 3 R 1 R Var R 1 R 2 > Var R 2 R 1 51 Wander Join: Online Aggregation via Random Walks

52 Walk Plan Optimizer Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful 52 Wander Join: Online Aggregation via Random Walks

53 Convergence Comparison 53 Wander Join: Online Aggregation via Random Walks [Li, Wu, Yi, Zhao, SIGMOD 16 Best Paper Award]

54 Wander Join in PostgreSQL Logarithmic growth due to B-tree lookup to find random neighbours 54 Wander Join: Online Aggregation via Random Walks

55 Running on Insufficient Memory (4GB) Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB 09] 55 Wander Join: Online Aggregation via Random Walks

56 Accuracy Achieved in 1/10 Time of Full Join 56 Wander Join: Online Aggregation via Random Walks

57 Wander Join vs Ripple Join Sampling methodology Wander Join Independent but non-uniform Ripple Join Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Convergence time (20GB data, 3 tables) Easy, O(n) time Complicated, O(n k ) time k: # tables ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO 57 Wander Join: Online Aggregation via Random Walks

58 Online Aggregation vs Data Cube Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report 58 Wander Join: Online Aggregation via Random Walks

59 Thank you!

60 Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD 90] Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks

61 Sampling from a B-tree [Olken, 93] Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions need to modify existing B-tree implementations 61 Wander Join: Online Aggregation via Random Walks

62 Rejection Sampling [Olken, 93] Imagine each node has maximum fanout Reject as soon as it walks out of bound 62 Wander Join: Online Aggregation via Random Walks

63 Non-Uniform Sampling As long as we can compute the sampling probability, wander join still works! 63 Wander Join: Online Aggregation via Random Walks

64 Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, 13] Wander Join BlinkDB Methodology Query Sampling Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced 64 Wander Join: Online Aggregation via Random Walks

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath

More information

Distributed Summaries

Distributed Summaries Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Effective computation of biased quantiles over data streams

Effective computation of biased quantiles over data streams Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com

More information

Optimal Random Sampling from Distributed Streams Revisited

Optimal Random Sampling from Distributed Streams Revisited Optimal Random Sampling from Distributed Streams Revisited Srikanta Tirthapura 1 and David P. Woodruff 2 1 Dept. of ECE, Iowa State University, Ames, IA, 50011, USA. snt@iastate.edu 2 IBM Almaden Research

More information

Lower Bound Techniques for Multiparty Communication Complexity

Lower Bound Techniques for Multiparty Communication Complexity Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand

More information

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles Edo Liberty Principal Scientist Amazon Web Services Streaming Quantiles Streaming Quantiles Manku, Rajagopalan, Lindsay. Random sampling techniques for space efficient online computation of order statistics

More information

Progressive & Algorithms & Systems

Progressive & Algorithms & Systems University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

A Mergeable Summaries

A Mergeable Summaries A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

A Mergeable Summaries

A Mergeable Summaries A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

Tracking Distributed Aggregates over Time-based Sliding Windows

Tracking Distributed Aggregates over Time-based Sliding Windows Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur

Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation

More information

Space-optimal Heavy Hitters with Strong Error Bounds

Space-optimal Heavy Hitters with Strong Error Bounds Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items

More information

Finding Frequent Items in Probabilistic Data

Finding Frequent Items in Probabilistic Data Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008

More information

26 Mergeable Summaries

26 Mergeable Summaries 26 Mergeable Summaries PANKAJ K. AGARWAL, Duke University GRAHAM CORMODE, University of Warwick ZENGFENG HUANG, Aarhus University JEFF M. PHILLIPS, University of Utah ZHEWEI WEI, Aarhus University KE YI,

More information

Simulation. Where real stuff starts

Simulation. Where real stuff starts 1 Simulation Where real stuff starts ToC 1. What is a simulation? 2. Accuracy of output 3. Random Number Generators 4. How to sample 5. Monte Carlo 6. Bootstrap 2 1. What is a simulation? 3 What is a simulation?

More information

Fast Sorting and Selection. A Lower Bound for Worst Case

Fast Sorting and Selection. A Lower Bound for Worst Case Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 0 Fast Sorting and Selection USGS NEIC. Public domain government image. A Lower Bound

More information

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher

Streaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010

CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Learning and Fourier Analysis

Learning and Fourier Analysis Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for

More information

Window-aware Load Shedding for Aggregation Queries over Data Streams

Window-aware Load Shedding for Aggregation Queries over Data Streams Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system

More information

A Continuous Sampling from Distributed Streams

A Continuous Sampling from Distributed Streams A Continuous Sampling from Distributed Streams Graham Cormode, AT&T Labs Research S. Muthukrishnan, Rutgers University Ke Yi, HKUST Qin Zhang, MADALGO, Aarhus University A fundamental problem in data management

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester

More information

Time-Decayed Correlated Aggregates over Data Streams

Time-Decayed Correlated Aggregates over Data Streams Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.

P Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam. Exam INFO-H-417 Database System Architecture 13 January 2014 Name: ULB Student ID: P Q1 Q2 Q3 Q4 Q5 Tot (60 (20 (20 (20 (60 (20 (200 Exam modalities You are allotted a maximum of 4 hours to complete this

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

1 Introduction (January 21)

1 Introduction (January 21) CS 97: Concrete Models of Computation Spring Introduction (January ). Deterministic Complexity Consider a monotonically nondecreasing function f : {,,..., n} {, }, where f() = and f(n) =. We call f a step

More information

ECLT 5810 Data Preprocessing. Prof. Wai Lam

ECLT 5810 Data Preprocessing. Prof. Wai Lam ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch

6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch 6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch Today Probabilistic Turing Machines and Probabilistic Time Complexity Classes Now add a new capability to standard TMs: random

More information

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2

CMPSCI611: Three Divide-and-Conquer Examples Lecture 2 CMPSCI611: Three Divide-and-Conquer Examples Lecture 2 Last lecture we presented and analyzed Mergesort, a simple divide-and-conquer algorithm. We then stated and proved the Master Theorem, which gives

More information

Multi-join Query Evaluation on Big Data Lecture 2

Multi-join Query Evaluation on Big Data Lecture 2 Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, 2015 1 / 34 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday

More information

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

MTAT Complexity Theory October 13th-14th, Lecture 6

MTAT Complexity Theory October 13th-14th, Lecture 6 MTAT.07.004 Complexity Theory October 13th-14th, 2011 Lecturer: Peeter Laud Lecture 6 Scribe(s): Riivo Talviste 1 Logarithmic memory Turing machines working in logarithmic space become interesting when

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland

MapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Processing Aggregate Queries over Continuous Data Streams

Processing Aggregate Queries over Continuous Data Streams Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata

CISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

CS361 Homework #3 Solutions

CS361 Homework #3 Solutions CS6 Homework # Solutions. Suppose I have a hash table with 5 locations. I would like to know how many items I can store in it before it becomes fairly likely that I have a collision, i.e., that two items

More information

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input

Bucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input Bucket-Sort Have seen lower bound of Ω(nlog n) for comparisonbased sorting algs Some cheating algorithms achieve O(n), given certain assumptions re input One example: bucket sort Assumption: input numbers

More information

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University

Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness

More information

Initial Sampling for Automatic Interactive Data Exploration

Initial Sampling for Automatic Interactive Data Exploration Initial Sampling for Automatic Interactive Data Exploration Wenzhao Liu 1, Yanlei Diao 1, and Anna Liu 2 1 College of Information and Computer Sciences, University of Massachusetts, Amherst 2 Department

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

Approximate Counts and Quantiles over Sliding Windows

Approximate Counts and Quantiles over Sliding Windows Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Abstract We consider the problem

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Simulation. Where real stuff starts

Simulation. Where real stuff starts Simulation Where real stuff starts March 2019 1 ToC 1. What is a simulation? 2. Accuracy of output 3. Random Number Generators 4. How to sample 5. Monte Carlo 6. Bootstrap 2 1. What is a simulation? 3

More information

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington Topics in Probabilistic and Statistical Databases Lecture 9: Histograms and Sampling Dan Suciu University of Washington 1 References Fast Algorithms For Hierarchical Range Histogram Construction, Guha,

More information

Quick Sort Notes , Spring 2010

Quick Sort Notes , Spring 2010 Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Geometric Problems in Moderate Dimensions

Geometric Problems in Moderate Dimensions Geometric Problems in Moderate Dimensions Timothy Chan UIUC Basic Problems in Comp. Geometry Orthogonal range search preprocess n points in R d s.t. we can detect, or count, or report points inside a query

More information

Divisible Load Scheduling

Divisible Load Scheduling Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute

More information

Errata and Proofs for Quickr [2]

Errata and Proofs for Quickr [2] Errata and Proofs for Quickr [2] Srikanth Kandula 1 Errata We point out some errors in the SIGMOD version of our Quickr [2] paper. The transitivity theorem, in Proposition 1 of Quickr, has a revision in

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Lecture 1: Introduction to Sublinear Algorithms

Lecture 1: Introduction to Sublinear Algorithms CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for

More information

REVIEW QUESTIONS. Chapter 1: Foundations: Sets, Logic, and Algorithms

REVIEW QUESTIONS. Chapter 1: Foundations: Sets, Logic, and Algorithms REVIEW QUESTIONS Chapter 1: Foundations: Sets, Logic, and Algorithms 1. Why can t a Venn diagram be used to prove a statement about sets? 2. Suppose S is a set with n elements. Explain why the power set

More information

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington

Communication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA 15 June 15, 2015 1-1 Model of computation The coordinator model: k sites and 1 coordinator.

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing

CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina Zoltan Gyongyi CS 347 Notes 03 1 Query Processing! Decomposition! Localization! Optimization CS 347

More information

Algorithms for Calculating Statistical Properties on Moving Points

Algorithms for Calculating Statistical Properties on Moving Points Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009

More information

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce

CS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications

More information

Lecture 59 : Instance Compression and Succinct PCP s for NP

Lecture 59 : Instance Compression and Succinct PCP s for NP IITM-CS6840: Advanced Complexity Theory March 31, 2012 Lecture 59 : Instance Compression and Succinct PCP s for NP Lecturer: Sivaramakrishnan N.R. Scribe: Prashant Vasudevan 1 Introduction Classical Complexity

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

Maintaining Significant Stream Statistics over Sliding Windows

Maintaining Significant Stream Statistics over Sliding Windows Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified

More information

A Tight Lower Bound for Dynamic Membership in the External Memory Model

A Tight Lower Bound for Dynamic Membership in the External Memory Model A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6

CS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6 CS 395T Computational Learning Theory Lecture 3: September 0, 2007 Lecturer: Adam Klivans Scribe: Mike Halcrow 3. Decision List Recap In the last class, we determined that, when learning a t-decision list,

More information

Recognizing Safety and Liveness by Alpern and Schneider

Recognizing Safety and Liveness by Alpern and Schneider Recognizing Safety and Liveness by Alpern and Schneider Calvin Deutschbein 17 Jan 2017 1 Intro 1.1 Safety What is safety? Bad things do not happen For example, consider the following safe program in C:

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta Approximation Algorithms and Hardness of Approximation IPM, Jan 2006 Mohammad R. Salavatipour Department of Computing Science University of Alberta 1 Introduction For NP-hard optimization problems, we

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information