Random Sampling on Big Data: Techniques and Applications Ke Yi
|
|
- Angela Robbins
- 5 years ago
- Views:
Transcription
1 : Techniques and Applications Ke Yi Hong Kong University of Science and Technology
2 Big Data in one slide The 3 V s: Volume Velocity Variety Integers, real numbers Points in a multi-dimensional space Records in relational database Graph-structured data 2
3 Dealing with Big Data The first approach: scale up / out the computation Many great technical innovations: Distributed/parallel systems Simpler programming models MapReduce, Pregel, Dremel, Spark BSP Failure tolerance and recovery Drop certain features: ACID, CAP, nosql This talk is not about this approach! 3
4 Downsizing data A second approach to computational scalability: scale down the data! A compact representation of a large data set Too much redundancy in big data anyway What we finally want is small: human readable analysis / decisions Necessarily gives up some accuracy: approximate answers Examples: samples, sketches, histograms, various transforms See tutorial by Graham Cormode for other data summaries Complementary to the first approach Can scale out computation and scale down data at the same time Algorithms need to work under new system architectures Good old RAM model no longer applies 4
5 Outline for the talk Simple random sampling Sampling from a data stream Sampling from distributed streams Sampling for range queries Not-so-simple sampling Importance sampling: Frequency estimation on distributed data Paired sampling: Medians and quantiles Random walk sampling: SQL queries (joins) Will jump back and forth between theory and practice 5
6 Simple Random Sampling Sampling without replacement Randomly draw an element Don t put it back Repeat s times Sampling with replacement Randomly draw an element Put it back Repeat s times Trivial in the RAM model The statistical difference is very small, for n s 6
7 Random Sampling from a Data Stream A stream of elements coming in at high speed Limited memory Need to maintain the sample continuously Applications Data stored on disk Network traffic 7
8 8
9 Reservoir Sampling Maintain a sample of size s drawn (without replacement) from all elements in the stream so far Keep the first s elements in the stream, set n s Algorithm for a new element n n + 1 With probability s/n, use it to replace an item in the current sample chosen uniformly at random With probability 1 s/n, throw it away Perhaps the first streaming algorithm [Waterman??; Knuth s book] 9
10 Correctness Proof 10 By induction on n n = s: trivially correct Assume each element so far is sampled with probability s/n Consider n + 1: The new element is sampled with probability s n+1 Any element in the current sample is sampled with probability s 1 s + s s 1 = s. Yeah! n n+1 n+1 s n+1 This is a wrong (incomplete) proof Each element being sampled with probability s is not a sufficient n condition of random sampling Counter example: Divide elements into groups of s and pick one group randomly
11 11
12 Reservoir Sampling Correctness Proof Many proofs found online are actually wrong They only show that each item is sampled with probability s/n Need to show that every subset of size s has the same probability to be the sample Correct proof relates with the Fisher-Yates shuffle s = 2 a a b b b b b a c d c c c a a d d d d c 12
13 Sampling from Distributed Streams 13 One coordinator and k sites Each site can communicate with the coordinator Goal: Maintain a random sample of size s over the union of all streams with minimum communication Difficulty: Don t know n, so can t run reservoir sampling algorithm Key observation: Don t have to know n in order to sample! [Cormode, Muthukrishnan, Yi, Zhang, PODS 10, JACM 12] [Woodruff, Tirthapura, DISC 11]
14 Reduction from Coin Flip Sampling Flip a fair coin for each element until we get 1 An element is active on a level if it is 0 If a level has s active elements, we can draw a sample from those active elements Key: The coordinator does not want all the active elements, which are too many! Choose a level appropriately 14
15 The Algorithm Initialize i 0 In round i: Sites send in every item w.p. 2 i (This is a coin-flip sample with prob. 2 i ) Coordinator maintains a lower sample and a higher sample: each received item goes to either with equal prob. (The lower sample is a sample with prob. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to advance to round i i + 1 Discard the upper sample Split the lower sample into a new lower sample and a higher sample 15
16 Communication Cost of Algorithm Communication cost of each round: O(k + s) Expect to receive O(s) sampled items before round ends Broadcast to end round: O(k) Number of rounds: O(log n) In each round, need Θ(s) items being sampled to end round Each item has prob. 2 i to contribute: need Θ(2 i s) items Total communication: O( k + s log n) Can be improved to O(k log k/s n + s log n) A matching lower bound Sliding windows 16
17 Random Sampling for Range Queries 17 [Christensen, Wang, Li, Yi, Tang, Villa, SIGMOD 15 Best Demo Award]
18 Online Range Sampling Problem Definition: Preprocess a set of points in the plane, so that for any range query, we can return samples (with or without replacement) drawn from all points in the range until user termination. Naïve solutions: Parameters: n: data size q: query size s: sample size (not known beforehand) n q s Query then sample: O f n + q Sample then query: O sn q (store data in random order) New solution: O f sn q + s f(x): # canonical nodes in tree of size x, between log x and x 18 [Wang, Christensen, Li, Yi, VLDB 16]
19 Indexing Spatial Data Numerous spatial indexing structures in the literature R-tree 19
20 RS-tree Attach a sample to node u drawn from leaves below u Total space: O(n) Construction time: O(n) 20
21 RS-tree: A 1D Example Report: 5 5 Active nodes
22 RS-tree: A 1D Example Report: 5 5 Active nodes
23 RS-tree: A 1D Example Report: Active nodes Pick 7 or 14 with equal prob
24 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 14 with prob. 1:1:
25 RS-tree: A 1D Example Report: Active nodes
26 RS-tree: A 1D Example Report: Active nodes Pick 3, 8, or 12 with equal prob
27 Not-So-Simple Random Sampling When simple random sampling is not optimal/feasible
28 Frequency Estimation on Distributed Data Given: A multiset S of n items drawn from the universe [u] For example: IP addresses of network packets S is partitioned arbitrarily and stored on k nodes Local count x ij : frequency of item i on node j Global count y i = x ij j Goal: Estimate y i with additive error εn for all i Can t hope for relative error for all y i Heavy hitters are estimated well [Huang, Yi, Liu, Chen, INFOCOM 11] 28
29 Frequency Estimation: Standard Solutions Local heavy hitters Let n j = i x ij be the data size at node j Node j sends in all items with frequency εn j Total error is at most εn j = εn j Communication cost: O(k/ε) Simple random sampling A simple random sample of size O(1/ε 2 ) can be used to estimate the frequency of any item with error εn Extra log factor for all items Algorithm Coordinator first gets n j for all j Decides how many samples to get from each j Get the samples from the nodes Communication cost: O(k + 1/ε 2 ) 29
30 Importance Sampling Horvitz Thompson estimator: X ij = x ij g x ij 0 if x ij sampled else Estimator for global count y i : Y i = X i,1 + + X i,k 30
31 Importance Sampling: What is a good g(x)? Natural choice: g 1 x = k εn x More precisely: g 1 x = max k εn x, 1 Can show: Var Y i = O εn 2 for any i Communication cost: O k/ε This is (worst-case) optimal Interesting discovery: g 2 x = g 1 x 2 Also has Var Y i = O εn 2 for any i Also has communication cost O k/ε in the worst case But can be much lower than g 1 (x) on some inputs 31
32 g 2 x is Instance-Optimal 32
33 Median and Quantiles (order statistics) Exact quantiles: F 1 ( ) for 0 < < 1, F : CDF Approximate version: tolerate answer between F 1 ( ) F 1 ( + ) 33
34 Estimating Median by Random Sampling Simple random sampling An ε-approximation needs a sample of size Θ(1/ε 2 ) Paired Sampling Divide data into chunks of size s = O(1/ε) Sort each chunk Do binary merges into one chunk Each merge takes odd-positioned or even-positioned elements with equal probability Similar ideas used in discrepancy methods This needs O(n log s) time, how is it useful?
35 Application 1: Streaming Computation Can merge chunks up as items arrive in the stream At any time, keep at most O log n chunks Space: O(1/ε log n) Can be improved to O(1/ε log(1/ε)) by combining with random sampling Can find all quantiles [Felber, Ostrovsky, 15] Reservoir sampling needs O(1/ε 2 ) space Best deterministic algorithm needs O(1/ε log n) space [Greenwald, Khana, 01] 35 [Wang, Luo, Yi, Cormode, SIGMOD 13]
36 Application 2: Distributed Data Data stored on k nodes Each node reduces its data to O 1/ε k using paired sampling, and send to coordinator The coordinator reduces all the data received to a size of O(1/ε) using paired sampling Communication cost: O( k/ε) Looks familiar? 36
37 Generalization: -approximations A sample that preserves density of point sets For any range (e.g., a circle), fraction of sample points fraction of all points ε Simple random sample needs size O(1/ε 2 ) Paired sampling yields size O(1/ε 2d/(d+1) ) 37
38 ε-approximations on distributed data 38 [Huang, Yi, FOCS 14]
39 Database Workloads Transactional (OLTP) Deduct x dollars from account A, credit x dollars to account B Challenge: Efficiency and correctness (ACID) Analytical (OLAP) Large fraction of data Many tables Complex conditions Challenge: Efficiency Correctness? 39 Wander Join: Online Aggregation via Random Walks
40 Complex Analytical Queries (TPC-H) SELECT SUM(l_extendedprice * (1 - l_discount)) FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' This query finds the total revenue loss due to returned orders in a given region. 40 Wander Join: Online Aggregation via Random Walks
41 Online Aggregation [Haas, Hellerstein, Wang, SIGMOD 97] SELECT ONLINE SUM(l_extendedprice * (1 - l_discount)) Y + ε Y Y ε FROM customer, lineitem, orders, nation, region WHERE c_custkey = o_custkey AND l_orderkey = o_orderkey AND l_returnflag = 'R' AND c_nationkey = n_nationkey AND n_regionkey = r_regionkey AND r_name = 'ASIA' WITHTIME CONFIDENCE 95 REPORTINTERVAL 1000 Confidence interval: Pr Y ε < Y < Y + ε > Wander Join: Online Aggregation via Random Walks
42 Ripple Join [Haas, Hellerstein, SIGMOD 99] Store tuples in each table in random order In each step Reads the next tuple from a table in a round-robin fashion Join with sampled tuples from other tables Works well for full Cartesian product But most joins are sparse 42 Wander Join: Online Aggregation via Random Walks
43 A Running Example Nation CID BuyerID OrderID OrderID ItemID Price US 1 UK $500 China 5 UK 8 Japan 9 UK What s the total revenue of all orders US China 3 from customers 1 in 3 China? 5 5 N: size of each table, e.g., 10 9 US n: # tuples 6 taken from 5 each 6 table China s: # estimators, e.g., n N9 2 = s n = N 2/3 s 1/3 = $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
44 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 44 Wander Join: Online Aggregation via Random Walks
45 Join as a Graph Conceptual only Never materialized R 1 R 2 R 3 45 Wander Join: Online Aggregation via Random Walks
46 Join as a Graph Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
47 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
48 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
49 Sampling by Random Walks Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
50 Sampling by Random Walks Nation CID BuyerID OrderID US US 2 China 3 UK 4 China 5 UK 8 Japan 9 UK OrderID ItemID Price $ $100 N: size of each table size, e.g., n: # tuples taken from each table = # random walks s: # estimators, e.g., n = s 5 = 10 3 US China Unbiased estimator: 5 8 $500 = sampling prob $ $ $ $ $ $200 $ $100 1/3 1/4 1/ $ Wander Join: Online Aggregation via Random Walks
51 Walk Plan Optimization Structure of the data graph Selection predicates Starting table: use index Table in the middle: reject random walk Data distribution Non-uniformity may not be a bad thing! R 1 R Var R 1 R 2 < Var R 2 R 1 R 1 R 2 R 3 R 1 R Var R 1 R 2 > Var R 2 R 1 51 Wander Join: Online Aggregation via Random Walks
52 Walk Plan Optimizer Enumerate all plans Conduct ~ 100 trial random walks using each plan Measure the variance of each plan Select the best plan All trials runs are still useful 52 Wander Join: Online Aggregation via Random Walks
53 Convergence Comparison 53 Wander Join: Online Aggregation via Random Walks [Li, Wu, Yi, Zhao, SIGMOD 16 Best Paper Award]
54 Wander Join in PostgreSQL Logarithmic growth due to B-tree lookup to find random neighbours 54 Wander Join: Online Aggregation via Random Walks
55 Running on Insufficient Memory (4GB) Insufficient memory incurs a heavy, one-time penalty Growth is still logarithmic Fundamentally: Random sampling at odds with hard disks But does it matter? Spark, In-Memory DB, RAM cloud The algorithm is embarrassingly parallel Turbo DBO [Dobra, Jermaine, Rusu, Xu, VLDB 09] 55 Wander Join: Online Aggregation via Random Walks
56 Accuracy Achieved in 1/10 Time of Full Join 56 Wander Join: Online Aggregation via Random Walks
57 Wander Join vs Ripple Join Sampling methodology Wander Join Independent but non-uniform Ripple Join Uniform but non-independent Index needed? Yes Index or random storage Confidence interval computation Convergence time (20GB data, 3 tables) Easy, O(n) time Complicated, O(n k ) time k: # tables ~ 3s ~ 50s Scalability Logarithmic Slightly less than linear System implementation PostgreSQL (finished) Oracle (in progress) SparkSQL (in progress) Informix (internal project) DBO 57 Wander Join: Online Aggregation via Random Walks
58 Online Aggregation vs Data Cube Online Aggregation Data Cube Queries Online, ad hoc Offline, fixed Latency Seconds Hours, then milliseconds Query mode One at a time Batch Accuracy Small error No error Data schema Any (relational, graph) Multidimensional cube Work with OLTP Integrated Separate Target scenario Online, ad hoc, interactive data analytics Monthly report 58 Wander Join: Online Aggregation via Random Walks
59 Thank you!
60 Index Ripple Join [Lipton, Naughton, Schneider, SIGMOD 90] Nation CID US 1 US 2 China 3 UK 4 China 5 US 6 China 7 UK 8 Japan 9 UK 10 BuyerID OrderID OrderID ItemID Price $ $ $ $ $ $ $ $ $ $ Wander Join: Online Aggregation via Random Walks
61 Sampling from a B-tree [Olken, 93] Sampling from an aggregate (ranked) B-tree is easy But incurs heavy cost for transactions need to modify existing B-tree implementations 61 Wander Join: Online Aggregation via Random Walks
62 Rejection Sampling [Olken, 93] Imagine each node has maximum fanout Reject as soon as it walks out of bound 62 Wander Join: Online Aggregation via Random Walks
63 Non-Uniform Sampling As long as we can compute the sampling probability, wander join still works! 63 Wander Join: Online Aggregation via Random Walks
64 Compare with BlinkDB [Agarwal, Mozafari, Panda, Milner, Madden, Stoica, 13] Wander Join BlinkDB Methodology Query Sampling Sampling Query Sampling method Random walks Stratified sampling Joins supported Any Big table joining a small table (no sampling on small table) Error Reduce over time Fixed Data schema Any (relational, graph) Star / snowflake Work with OLTP Integrated Separate Group-by support Unbalanced Balanced 64 Wander Join: Online Aggregation via Random Walks
11 Heavy Hitters Streaming Majority
11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath
More informationDistributed Summaries
Distributed Summaries Graham Cormode graham@research.att.com Pankaj Agarwal(Duke) Zengfeng Huang (HKUST) Jeff Philips (Utah) Zheiwei Wei (HKUST) Ke Yi (HKUST) Summaries Summaries allow approximate computations:
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationBiased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan
Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationEffective computation of biased quantiles over data streams
Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com
More informationOptimal Random Sampling from Distributed Streams Revisited
Optimal Random Sampling from Distributed Streams Revisited Srikanta Tirthapura 1 and David P. Woodruff 2 1 Dept. of ECE, Iowa State University, Ames, IA, 50011, USA. snt@iastate.edu 2 IBM Almaden Research
More informationLower Bound Techniques for Multiparty Communication Complexity
Lower Bound Techniques for Multiparty Communication Complexity Qin Zhang Indiana University Bloomington Based on works with Jeff Phillips, Elad Verbin and David Woodruff 1-1 The multiparty number-in-hand
More informationEdo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles
Edo Liberty Principal Scientist Amazon Web Services Streaming Quantiles Streaming Quantiles Manku, Rajagopalan, Lindsay. Random sampling techniques for space efficient online computation of order statistics
More informationProgressive & Algorithms & Systems
University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationA Mergeable Summaries
A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationLeveraging Big Data: Lecture 13
Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector
More informationA Mergeable Summaries
A Mergeable Summaries Pankaj K. Agarwal, Graham Cormode, Zengfeng Huang, Jeff M. Phillips, Zhewei Wei, and Ke Yi We study the mergeability of data summaries. Informally speaking, mergeability requires
More informationComputing the Entropy of a Stream
Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound
More informationTracking Distributed Aggregates over Time-based Sliding Windows
Tracking Distributed Aggregates over Time-based Sliding Windows Graham Cormode and Ke Yi Abstract. The area of distributed monitoring requires tracking the value of a function of distributed data as new
More informationECEN 689 Special Topics in Data Science for Communications Networks
ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as
More informationTight Bounds for Distributed Functional Monitoring
Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationProbability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur
Probability Methods in Civil Engineering Prof. Dr. Rajib Maity Department of Civil Engineering Indian Institution of Technology, Kharagpur Lecture No. # 36 Sampling Distribution and Parameter Estimation
More informationSpace-optimal Heavy Hitters with Strong Error Bounds
Space-optimal Heavy Hitters with Strong Error Bounds Graham Cormode graham@research.att.com Radu Berinde(MIT) Piotr Indyk(MIT) Martin Strauss (U. Michigan) The Frequent Items Problem TheFrequent Items
More informationFinding Frequent Items in Probabilistic Data
Finding Frequent Items in Probabilistic Data Qin Zhang, Hong Kong University of Science & Technology Feifei Li, Florida State University Ke Yi, Hong Kong University of Science & Technology SIGMOD 2008
More information26 Mergeable Summaries
26 Mergeable Summaries PANKAJ K. AGARWAL, Duke University GRAHAM CORMODE, University of Warwick ZENGFENG HUANG, Aarhus University JEFF M. PHILLIPS, University of Utah ZHEWEI WEI, Aarhus University KE YI,
More informationSimulation. Where real stuff starts
1 Simulation Where real stuff starts ToC 1. What is a simulation? 2. Accuracy of output 3. Random Number Generators 4. How to sample 5. Monte Carlo 6. Bootstrap 2 1. What is a simulation? 3 What is a simulation?
More informationFast Sorting and Selection. A Lower Bound for Worst Case
Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 0 Fast Sorting and Selection USGS NEIC. Public domain government image. A Lower Bound
More informationStreaming Graph Computations with a Helpful Advisor. Justin Thaler Graham Cormode and Michael Mitzenmacher
Streaming Graph Computations with a Helpful Advisor Justin Thaler Graham Cormode and Michael Mitzenmacher Thanks to Andrew McGregor A few slides borrowed from IITK Workshop on Algorithms for Processing
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More informationCSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010
CSC 5170: Theory of Computational Complexity Lecture 4 The Chinese University of Hong Kong 1 February 2010 Computational complexity studies the amount of resources necessary to perform given computations.
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationLearning and Fourier Analysis
Learning and Fourier Analysis Grigory Yaroslavtsev http://grigory.us Slides at http://grigory.us/cis625/lecture2.pdf CIS 625: Computational Learning Theory Fourier Analysis and Learning Powerful tool for
More informationWindow-aware Load Shedding for Aggregation Queries over Data Streams
Window-aware Load Shedding for Aggregation Queries over Data Streams Nesime Tatbul Stan Zdonik Talk Outline Background Load shedding in Aurora Windowed aggregation queries Window-aware load shedding Experimental
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 3: Query Processing Query Processing Decomposition Localization Optimization CS 347 Notes 3 2 Decomposition Same as in centralized system
More informationA Continuous Sampling from Distributed Streams
A Continuous Sampling from Distributed Streams Graham Cormode, AT&T Labs Research S. Muthukrishnan, Rutgers University Ke Yi, HKUST Qin Zhang, MADALGO, Aarhus University A fundamental problem in data management
More informationApproximate counting: count-min data structure. Problem definition
Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationMapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland
MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, first semester
More informationTime-Decayed Correlated Aggregates over Data Streams
Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationTight Bounds for Distributed Streaming
Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming
More informationP Q1 Q2 Q3 Q4 Q5 Tot (60) (20) (20) (20) (60) (20) (200) You are allotted a maximum of 4 hours to complete this exam.
Exam INFO-H-417 Database System Architecture 13 January 2014 Name: ULB Student ID: P Q1 Q2 Q3 Q4 Q5 Tot (60 (20 (20 (20 (60 (20 (200 Exam modalities You are allotted a maximum of 4 hours to complete this
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More information1 Introduction (January 21)
CS 97: Concrete Models of Computation Spring Introduction (January ). Deterministic Complexity Consider a monotonically nondecreasing function f : {,,..., n} {, }, where f() = and f(n) =. We call f a step
More informationECLT 5810 Data Preprocessing. Prof. Wai Lam
ECLT 5810 Data Preprocessing Prof. Wai Lam Why Data Preprocessing? Data in the real world is imperfect incomplete: lacking attribute values, lacking certain attributes of interest, or containing only aggregate
More informationWavelet decomposition of data streams. by Dragana Veljkovic
Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records
More information6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch
6.045: Automata, Computability, and Complexity (GITCS) Class 17 Nancy Lynch Today Probabilistic Turing Machines and Probabilistic Time Complexity Classes Now add a new capability to standard TMs: random
More informationCMPSCI611: Three Divide-and-Conquer Examples Lecture 2
CMPSCI611: Three Divide-and-Conquer Examples Lecture 2 Last lecture we presented and analyzed Mergesort, a simple divide-and-conquer algorithm. We then stated and proved the Master Theorem, which gives
More informationMulti-join Query Evaluation on Big Data Lecture 2
Multi-join Query Evaluation on Big Data Lecture 2 Dan Suciu March, 2015 Dan Suciu Multi-Joins Lecture 2 March, 2015 1 / 34 Multi-join Query Evaluation Outline Part 1 Optimal Sequential Algorithms. Thursday
More informationCORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES
CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
More informationMTAT Complexity Theory October 13th-14th, Lecture 6
MTAT.07.004 Complexity Theory October 13th-14th, 2011 Lecturer: Peeter Laud Lecture 6 Scribe(s): Riivo Talviste 1 Logarithmic memory Turing machines working in logarithmic space become interesting when
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationMapReduce in Spark. Krzysztof Dembczyński. Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland
MapReduce in Spark Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second semester
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationProcessing Aggregate Queries over Continuous Data Streams
Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationCISC 4090: Theory of Computation Chapter 1 Regular Languages. Section 1.1: Finite Automata. What is a computer? Finite automata
CISC 4090: Theory of Computation Chapter Regular Languages Xiaolan Zhang, adapted from slides by Prof. Werschulz Section.: Finite Automata Fordham University Department of Computer and Information Sciences
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationCS361 Homework #3 Solutions
CS6 Homework # Solutions. Suppose I have a hash table with 5 locations. I would like to know how many items I can store in it before it becomes fairly likely that I have a collision, i.e., that two items
More informationBucket-Sort. Have seen lower bound of Ω(nlog n) for comparisonbased. Some cheating algorithms achieve O(n), given certain assumptions re input
Bucket-Sort Have seen lower bound of Ω(nlog n) for comparisonbased sorting algs Some cheating algorithms achieve O(n), given certain assumptions re input One example: bucket sort Assumption: input numbers
More informationVerifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University
Verifying Computations in the Cloud (and Elsewhere) Michael Mitzenmacher, Harvard University Work offloaded to Justin Thaler, Harvard University Goals of Verifiable Computation Provide user with correctness
More informationInitial Sampling for Automatic Interactive Data Exploration
Initial Sampling for Automatic Interactive Data Exploration Wenzhao Liu 1, Yanlei Diao 1, and Anna Liu 2 1 College of Information and Computer Sciences, University of Massachusetts, Amherst 2 Department
More informationJeffrey D. Ullman Stanford University
Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some
More informationApproximate Counts and Quantiles over Sliding Windows
Approximate Counts and Quantiles over Sliding Windows Arvind Arasu Stanford University arvinda@cs.stanford.edu Gurmeet Singh Manku Stanford University manku@cs.stanford.edu Abstract We consider the problem
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationSimulation. Where real stuff starts
Simulation Where real stuff starts March 2019 1 ToC 1. What is a simulation? 2. Accuracy of output 3. Random Number Generators 4. How to sample 5. Monte Carlo 6. Bootstrap 2 1. What is a simulation? 3
More informationTopics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington
Topics in Probabilistic and Statistical Databases Lecture 9: Histograms and Sampling Dan Suciu University of Washington 1 References Fast Algorithms For Hierarchical Range Histogram Construction, Guha,
More informationQuick Sort Notes , Spring 2010
Quick Sort Notes 18.310, Spring 2010 0.1 Randomized Median Finding In a previous lecture, we discussed the problem of finding the median of a list of m elements, or more generally the element of rank m.
More informationRandomized Algorithms
Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours
More informationGeometric Problems in Moderate Dimensions
Geometric Problems in Moderate Dimensions Timothy Chan UIUC Basic Problems in Comp. Geometry Orthogonal range search preprocess n points in R d s.t. we can detect, or count, or report points inside a query
More informationDivisible Load Scheduling
Divisible Load Scheduling Henri Casanova 1,2 1 Associate Professor Department of Information and Computer Science University of Hawai i at Manoa, U.S.A. 2 Visiting Associate Professor National Institute
More informationErrata and Proofs for Quickr [2]
Errata and Proofs for Quickr [2] Srikanth Kandula 1 Errata We point out some errors in the SIGMOD version of our Quickr [2] paper. The transitivity theorem, in Proposition 1 of Quickr, has a revision in
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationLecture 1: Introduction to Sublinear Algorithms
CSE 522: Sublinear (and Streaming) Algorithms Spring 2014 Lecture 1: Introduction to Sublinear Algorithms March 31, 2014 Lecturer: Paul Beame Scribe: Paul Beame Too much data, too little time, space for
More informationREVIEW QUESTIONS. Chapter 1: Foundations: Sets, Logic, and Algorithms
REVIEW QUESTIONS Chapter 1: Foundations: Sets, Logic, and Algorithms 1. Why can t a Venn diagram be used to prove a statement about sets? 2. Suppose S is a set with n elements. Explain why the power set
More informationCommunication-Efficient Computation on Distributed Noisy Datasets. Qin Zhang Indiana University Bloomington
Communication-Efficient Computation on Distributed Noisy Datasets Qin Zhang Indiana University Bloomington SPAA 15 June 15, 2015 1-1 Model of computation The coordinator model: k sites and 1 coordinator.
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationCOMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform
COMS E6998-9 F15 Lecture 22: Linearity Testing Sparse Fourier Transform 1 Administrivia, Plan Thu: no class. Happy Thanksgiving! Tue, Dec 1 st : Sergei Vassilvitskii (Google Research) on MapReduce model
More information7 Algorithms for Massive Data Problems
7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random
More informationCS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014
CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /
More informationCS 347 Distributed Databases and Transaction Processing Notes03: Query Processing
CS 347 Distributed Databases and Transaction Processing Notes03: Query Processing Hector Garcia-Molina Zoltan Gyongyi CS 347 Notes 03 1 Query Processing! Decomposition! Localization! Optimization CS 347
More informationAlgorithms for Calculating Statistical Properties on Moving Points
Algorithms for Calculating Statistical Properties on Moving Points Dissertation Proposal Sorelle Friedler Committee: David Mount (Chair), William Gasarch Samir Khuller, Amitabh Varshney January 14, 2009
More informationCS 347. Parallel and Distributed Data Processing. Spring Notes 11: MapReduce
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 11: MapReduce Motivation Distribution makes simple computations complex Communication Load balancing Fault tolerance Not all applications
More informationLecture 59 : Instance Compression and Succinct PCP s for NP
IITM-CS6840: Advanced Complexity Theory March 31, 2012 Lecture 59 : Instance Compression and Succinct PCP s for NP Lecturer: Sivaramakrishnan N.R. Scribe: Prashant Vasudevan 1 Introduction Classical Complexity
More informationLecture 1 September 3, 2013
CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html
More informationMaintaining Significant Stream Statistics over Sliding Windows
Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified
More informationA Tight Lower Bound for Dynamic Membership in the External Memory Model
A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model
More informationLecture 2 September 4, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the
More informationCS 395T Computational Learning Theory. Scribe: Mike Halcrow. x 4. x 2. x 6
CS 395T Computational Learning Theory Lecture 3: September 0, 2007 Lecturer: Adam Klivans Scribe: Mike Halcrow 3. Decision List Recap In the last class, we determined that, when learning a t-decision list,
More informationRecognizing Safety and Liveness by Alpern and Schneider
Recognizing Safety and Liveness by Alpern and Schneider Calvin Deutschbein 17 Jan 2017 1 Intro 1.1 Safety What is safety? Bad things do not happen For example, consider the following safe program in C:
More informationRange-efficient computation of F 0 over massive data streams
Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State
More informationApproximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta
Approximation Algorithms and Hardness of Approximation IPM, Jan 2006 Mohammad R. Salavatipour Department of Computing Science University of Alberta 1 Introduction For NP-hard optimization problems, we
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More information