Processing Aggregate Queries over Continuous Data Streams
|
|
- Ginger Francis
- 6 years ago
- Views:
Transcription
1 Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003
2 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development Departments did eid ename did 10 Nicole Emma 15 5 Dan 15 2 Mike 17 Employees Join operator: did dname eid ename 15 Legal 11 Emma 15 Legal 5 Dan 17 Marketing 10 Nicole 17 Marketing 2 Mike Subset of cross product that satisfies constraint Relations contain no redundant information. Joins put information back together 2
3 Relational vs Streaming Data Relational Database Systems Relations reside on disk Access to data under complete control Queries refer to current state and posed once New Applications: network monitoring, sensor networks, telecom call records Data arrives in the form of continuous stream Order and rate or arrival not under the control of the system Queries are long running Most assumptions in relational database systems break Computation over Data Streams Frequency moments [AGM96], distinct values [FM96], norms [I00] Stream processing engines: NIAGARA, STREAM, Aurora 3
4 Processing Network Data Streams Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Challenge: deal with large amounts of streaming data using small memory approximate answers often suffice (e.g., trend/pattern analysis Other applications: one pass queries, sensor networks 4
5 Streaming Computational Model Synopses for R 1... Synopses for R r... Memory Stream for R 1 Stream for R 2 Stream for R r Stream Query-Processing Engine Approximate answers to queries Q 1,...,Q q Query Workload Q 1 Q q Computational model: Single Pass: each record examined at most once Fixed Order: no assumption about the arrival order can be made Small Space: log or poly-log in data stream size main memory algorithms 5
6 Class of Queries: Aggregates over Joins Equi-join J specified by relations R 1,...,R r and join constraints R 1.a 1 = R 2.a 2,... J = R 1 R r = { } t R 1 R r t. R1.a = t. 1 R2.a,... 2 R 1 a 1 a 2 R 2 Queries specified by a join and an aggregate: COUNT, SUM COUNT(R 1 R r is the number of tuples of the join Self join size: SJ(R = COUNT(R R 6
7 Example: Two-Way Join COUNT Query Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G = fg T = [ ] = = 13 Space requirement proportional with size of f and g Multi dimensional data space can be prohibitive E.g. Three attributes, each with domains of size words 7
8 Stream Data Synopses Frequency table maintenance over streams requires too much space Summarization required Conventional data summaries fall short Quantiles and 1-d histograms [MSL98], [GK 01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees Samples (e.g., using Reservoir Sampling Perform poorly for non foreign key joins [AGMS99] Cannot handle deletions of records Multi-dimensional histograms/wavelets Construction requires multiple passes over the data Concurrent work [TGIK02] Our approach: use sketches [AMS96],[AGMS99] 8
9 Outline of the Talk Background Introduction to sketches Log space in the size of stream and size of domain of attributes Provable probabilistic guarantees Insertions and deletions possible Sketch partitioning Sketch sharing Future work 9
10 Recall Example Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G= fg T = [ ] = = 13 10
11 Main Idea: Basic Sketching Technique (Alon, Gibbons, Matias, Szegedy 1999 Summarize information in the stream with a single number sketch Use sketches to recover query result Sketches: Choose random ξ 1,ξ 2,ξ 3 { 1,1} Stream F: a Stream G: a X F = ξ 1 + ξ 1 + ξ 2 + ξ 3 + ξ 1 + ξ 3 X G = ξ 3 + ξ 1 + ξ 3 + ξ 1 + ξ 1 Claim X = X F X G estimates COUNT(F a G 11
12 Analysis of Basic Sketching Technique Random vectors: ξ = [ ξ 1...ξ n ] random vector of ±1 values, called projection vector Sketches: Sketch of F, X F = t F ξ t.a = i f i ξ i = fξ T Sketch of G, X G = t G ξ t.a = i g i ξ i = gξ T X = X F X G estimates COUNT(F a G since if E[ξ T ξ ] = I i 1 i 2,E[ξ i1 ξ i2 ] = 0 X is not precise if i 1 i 2 i 3 i 4, E[ξ i1 ξ i2 ξ i3 ξ i4 ] = 0 E[X] = E[fξ T ξ g T ] = fe[ξ T ξ ]g T = fig T = fg T Var(X 2ff T gg T = 2 SJ(F SJ(G ξ need not have elements fully independent. 4-wise independence suffices ξ can be efficiently generated from small seeds [ABI86] ξ vector is not stored. Required elements generated on the fly from seeds 12
13 !!,-,- ( ( % %$ Sketch Error Reduction Standard Solution Estimation of COUNT(F a G from single sketches of F and G is too noisy Solution: Average 8 ε 2 Var(X E 2 [X] E 2 [X] Var(X independent copies of X to reduce error to ε Compute median of 2log1/δ such averages to increase confidence to 1 δ Stream F Sketches of F Seeds Stream G ξ ξ Sketches of G Independent copies of X Average Median COUNT(F G(1 ± ε with prob (1 δ Memory required weakly dependent on the size of the stream: logn + 1 bits/sketch 13
14 Properties of Sketches Advantages: Space requirement logarithmic in the size of the stream and of the domain of the attributes Provable probabilistic approximation guarantees Both insertion and deletion possible For deletion tuples simply decrement the sketch instead of incrementing Simplicity Disadvantages: They apply only to the query COUNT(F a G When Var(X E 2 [X] 1 a large amount of space required 14
15 Estimation of COUNT(R 1 R r (Dobra, Garofalakis, Gehrke, Rastogi, 2002 (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 15
16 Our Contributions Unless distributions in F and G are similar Var(X 1 E 2 [X] Need to produce a large number of copies of X Contributions: Two novel methods to reduce the error of estimates Sketch partitioning (Dobra, Garofkalakis, Gehrke, Rastogi, 2002 Use extra information to intelligently partition the problem The partitioning results in error reduction Sketch sharing (Dobra, Garofkalakis, Gehrke, Rastogi, 2003 Applicable when multiple queries are simultaneously answered Share sketches between queries thus saving space Multi query optimization problem 16
17 Outline of the Talk Background Introduction to sketches Sketch partitioning Use statistical information to reduce error Sketch sharing Future work 17
18 Sketch Partitioning [DGGR02] Large variance means loose estimation guarantees Building estimate with smaller variance reduces error Consider query: COUNT(F a G i f i g i E[X] = 165, E 2 [X] = Var(X 2 SJ(F SJ(G = 2( ( = Idea: Split I = {1,2,3,4} into I 1 = {1,3} and I 2 = {2,4} F I 1 F 1 X 1 I 2 F 2 X = X 1 + X 2 G I 1 G 1 I 2 G 2 E[X ] = COUNT(F 1 G 1 + COUNT(F 2 G 2 = COUNT(F G X 2 18
19 Sketch Partitioning (cont. Estimation of COUNT(F 1 G 1 i f i g i Var(X 1 2 SJ(F 1 SJ(G 1 = 2( ( = Estimation of COUNT(F 2 G 2 i f i g i Var(X 2 2 SJ(F 2 SJ(G 2 = 2( ( = Var(X = Var(X 1 + Var(X 2 = Improvement Var(X/m Var(X /(m/2 = ε Intelligently splitting the domain of the join attribute reduction in error Question: Assuming we know f and g, what is the best way to split? 19
20 Binary Sketch Partitioning Optimization Problem: Find partition I = I 1 I 2 space allocation m 1 for X 1, m 2 for X 2 s.t. m 1 + m 2 = m that minimizes where Var(X 1 m 1 + Var(X 2 m 2, Var(X k 2 fi 2 i I k i I k g 2 i. Solution: Allocate space proportional to Var(X k Transformed optimization criterion: Ψ(I 1 = fi 2 g 2 i + i I 1 i I 1 Naive solution: check all 2 I ways to partition fi 2 g 2 i i I 2 i I 2 20
21 Breiman s Theorem[BFOS84]: the problem has the property that: Binary Sketch Partitioning argmin I1,I 2 i I 1 q i Φ For q i > 0,r i > 0 and Φ(x concave, an optimum of ( i I1 q i r i i I1 q i + i I 2 q i Φ i I 1, j I 2,r i < r j ( i I2 q i r i i I2 q i Corollary: The solution of the optimization problem can be found in O( I log I by sorting i I in increasing order of r i and considering splits only in this order. Application 1: Finding the split point for discrete attribute in classification tree construction. Taking Φ(x = 2x(1 x, q i = P[X = x i ], r i = P[C = c 0 X = x i ] the criterion is gini index (impurity after split. Application 2: Find the optimal binary partitioning: Φ(x = x, q i = f i 2 i I fi 2, r i = g2 i i f 2, criterion becomes Ψ(I 1 i I f 2 i 21
22 Algorithm: Binary Sketch Partitioning Algorithm Order elements of I in increasing order of g i f i Choose the partitioning in this order that minimizes Ψ(I 1 For the optimal partitioning take memory allocation proportional to Var(X k Example: Optimal partition i g i f i Optimal memory allocation 5 : 6 Ordering within I: 1,3,2,4 Optimal partition: I 1 = {1,3}, I 2 = {2,4} Considering splits only in this order matches the intuition 22
23 Sketch Partitioning: Extensions Using Imperfect Information: Can get good partitioning without knowledge of perfect frequency tables Use rough unidimensional histograms instead time and space dependency on number of buckets instead of I provable approximation quality Generalization: K-ary split partitioning Needs generalization of Breiman s Theorem Dynamic programming algorithm Multi-way joins NP-hard With independence assumption becomes tractable decomposes into independent unidimensional problems 23
24 Experimental Study Datasets: Synthetic data sets [Vitter & Wang 99]: Frequency Rectangular regions in multi-dimensional attribute space uniformly distributed. Domain size Attribute Attribute Each region has a Zipfian distribution Number of tuples in each region Zipf distributed Total number of tuples: 10M To simulate correlations we perturbed the placement of regions Census data set ( reported in [DGGR02] Comparison: Query load: estimation using basic method JOIN-COUNT queries Error metric: relative error = 100 actual approx actual % 24
25 Join of two independent unidimensional relations on a common attribute buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 4000 words. 45% - 65% improvement Partitioning in two good enough; finer histograms do not help much 25
26 Chain join of three bidimentional relations on two common attributes buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 9000 words. 30% - 50% improvement Same trends 26
27 Outline of the Talk Background Introduction to sketches Sketch partitioning Sketch sharing Exploit commonality in the queries to reduce error Future work 27
28 Answering multiple queries simultaneously [DGGR, under review] Can apply the basic technique for each join independently COUNT(R 1 R 2 R 1 a ξ (1 a R 2 t R1 ξ (1 t.a t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 R 1 R 3 R a ξ (2 a b ξ (3 a 2 t R1 ξ (2 t.a t R3 ξ (2 t.a ξ (3 t.b t R2 ξ (3 t.a If sketches are reused fewer sketches have to be maintained The total available space is divided between fewer sketches Question: How and when can sketches be reused? 28
29 Sketch Sharing R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a Sharing the sketch of R 1 for both joins Use ξ (1 to enforce both <R 1.a,R 2.a> and <R 1.a,R 3.a> constraints Number of sketches reduced from 5 to 4 Each sketch gets 25% more space 29
30 Sketch Sharing R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b Sketch sharing for R 1 ξ (1 on both <R 1.a,R 2.a> and <R 3.b,R 2.a> edges ξ (1 used on all edges Number of sketches reduced from 5 to 3 Each sketch gets 66% more space Simple sketch sharing algorithm 30
31 Sketch Sharing Problem: Estimate COUNT(R 1 R 2 R 3 = f R1 f T R 2 f T R 3 R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Φ = E [ COUNT(R 1 R 3 R 2 ξ t.a ξ t.a ξ t.b ξ t.a t R 1 t R 2 t R 3 = E[f R1 ξ T ξ f T R 2 ξ T ξ f T R 3 ] = f R1 E[ξ T ξ f T R 2 ξ T ξ ]f T R 3 ] t R3 ξ (1 t.a ξ (1 t.b f R1 f T R 2 f T R 3 since E[ξ T ξ f T R 2 ξ T ξ ] f T R 2 Φ COUNT(R 1 R 2 R 3 sharing sketch for R 2 results in wrong estimation Apparent reason: ξ (1 assigned to both constraints of red join 31
32 Estimation of COUNT(R 1 R r (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 32
33 Sketch Sharing: Conflicts in Sharing Schemes Correctness: within each join a different ξ -family is assigned to each join constraint. R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 No conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b 33
34 Sketch Sharing: Algorithm Problem: Given a set of queries to be approximated compute, the sketch sharing scheme with no conflicts that has the smaller number of sketches. Characterization: Problem is NP-hard. Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible 34
35 Key Observation: Space Allocation Problem Allocating identical space to each sketch might not optimize average error/max error Relative square error for Q ε 2 Q Var(X M Q E 2 [X] = W Q M Q m v memory allocation of sketch v V, M Q memory allocation for query Q Q Space allocation problem: Find m v,v V and M Q,Q Q that minimizes either Average Error: W Q, or Q Q M Q W Q Max Error: max Q Q M Q such that v V, Q Q(v : M Q m v (sketch-query constraint v V m v = m (total memory constraint 35
36 Solving the Space Allocation Problem Can integrate space allocation problems into sketch sharing algorithm Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible In Step 2 of algorithm solve the space allocation problem for each pair pick among the candidate pairs the one that reduces error the most Minimize max error: reduces to classic optimization problem Can be solved in O( V log V Minimize average error: complex convex optimization problem Integer version NP-hard Continuous version gives solution with good approximation efficient custom solution 36
37 Continuous Average Error Problem Optimization problem: Strictly convex optimization problem min W Q Φ(M Q Q Q v V, Q Q(v : v V m v = m M Q m v Solvable in polynomial time with general convex solvers. Slow in practice Solution satisfies the Karush-Kuhn-Tucker conditions: Q Q : W Q Φ (M Q + v V (Q µ v,q = 0 v V : Q Q(v µ v,q = λ Q Q, v J : µ v,q (m v M Q = 0 m v = M v J 37
38 Continuous Average Error Problem KKT equations not directly solvable Key idea: define equivalence relation v V and Q Q(v are equivalent if m v = M Q in optimal solution + closure Induces equivalence classes C 1,C 2,... over V Q C 2 C 4 C 1 C 3 C 7 C 5 C 6 Using KKT conditions can show: Is enough to find the equivalence classes C 1,C 2,... to get the solution Can get conditions that equivalence classes satisfy Can setup a Max-flow problem that finds divider between classes 38
39 Experiments TPC-H Frequency PART Supplier Attribute PARTKEY 1 SUPPKEY NATIONKEY Attribute PARTKEY 9 PARTSUP SUPPKEY 8 CUSTOMER PARTKEY 7 SUPPKEY 6 CUSTKEY 4 LINEITEM ORDERKEY 5 ORDERS ORDERKEY 10 Q 1 1,2 Q 9 1 Q 17 8,9 Q 25 2,7 Q 2 4,5 Q 10 6,7 Q 18 5,9 Q 26 1,6 Q 3 3,4,5 Q 11 5,8 Q 19 6,8 Q 27 3,8 Q 4 4,5,8 Q Q 20 7,8 Q 28 1,2,3 Q 5 4,5,8,9 Q 13 4 Q 21 8 Q 29 2,3,4 Q 6 2 Q 14 3 Q 22 6 Q 7 5 Q 15 3,4 Q 23 7 Q 8 9 Q 16 5,8 Q 24 2,3 W 1 = Q 1 : Q 12, W 2 = Q 1 : Q 29 39
40 Average Error for Workload W sketch sharing no sharing 25 Relative Error(% Total Memory(words Relative error reduced by 35% Number of sketches reduced from 34 to 16 40
41 Average Error for Workload W sketch sharing no sharing 40 Relative Error(% Total Memory(words Relative error reduced by 45%. Larger reduction Number of sketches reduced from 82 to 25 41
42 Summary Basic sketch technique Summarize streams by projecting them on random vectors Log space, provable estimation guarantees, deletions Require a lot of space when variance is large Contributions: Sketch partitioning Uses extra information to intelligently split domain of attributes Results in better approximation scheme Sketch sharing Exploits commonality in queries Space and computation is shared among queries Both methods lead to interesting optimization problems 42
43 Future Work: Applications Data-mining Monitoring Statistics, Correlations Aggregates over Joins Aggregates over joins are building blocks for higher level applications Useful for correlating information from multiple sources Integrate approximate query processing with data-mining and monitoring 43
44 Future Work: Distributed Computation Sketches are linear: SK(D 1 D 2 = SK(D 1 + SK(D 2 Sketches are useful for distributed computation No need to move data. Send sketches instead Computation of aggregates over joins as easy as computation of SUM Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Network Monitoring Peer-to-peer/sensor network Gossip-style decentralized algorithms for aggregate queries (with Kempe & Gehrke 44
45 Future work Extensions Combination with other synopses data-structures Distinct value queries Retroactive queries Improvements in the presence of extra knowledge Types of extra knowledge Schema information: foreign keys, size of relations Statistical information Workload information Find novel ways to use extra knowledge Approximate stream query processing system Integration into existing system or new system 45
46 Questions 46
Sketching Sampled Data Streams
Sketchng Sampled Data Streams Florn Rusu and Aln Dobra CISE Department Unversty of Florda March 31, 2009 Motvaton & Goal Motvaton Multcore processors How to use all the processng power? Parallel algorthms
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationCorrelated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)
Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationData Streams & Communication Complexity
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More information7 Algorithms for Massive Data Problems
7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationDatabases 2011 The Relational Algebra
Databases 2011 Christian S. Jensen Computer Science, Aarhus University What is an Algebra? An algebra consists of values operators rules Closure: operations yield values Examples integers with +,, sets
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationCS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =
CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)
More informationTopics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington
Topics in Probabilistic and Statistical Databases Lecture 9: Histograms and Sampling Dan Suciu University of Washington 1 References Fast Algorithms For Hierarchical Range Histogram Construction, Guha,
More informationSchema Refinement and Normal Forms
Schema Refinement and Normal Forms Chapter 19 Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational
More informationProgressive & Algorithms & Systems
University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More informationCausality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks
Plausible Clocks with Bounded Inaccuracy Causality & Concurrency a b exists a path from a to b Brad Moore, Paul Sivilotti Computer Science & Engineering The Ohio State University paolo@cse.ohio-state.edu
More informationOn Two Class-Constrained Versions of the Multiple Knapsack Problem
On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic
More informationSchema Refinement and Normal Forms. Chapter 19
Schema Refinement and Normal Forms Chapter 19 1 Review: Database Design Requirements Analysis user needs; what must the database do? Conceptual Design high level descr. (often done w/er model) Logical
More informationThe Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd.
The Evils of Redundancy Schema Refinement and Normal Forms Chapter 19 Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Redundancy is at the root of several problems associated with relational
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationThe Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram
Schema Refinement and Normal Forms Chapter 19 Database Management Systems, R. Ramakrishnan and J. Gehrke 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationSchema Refinement and Normal Forms Chapter 19
Schema Refinement and Normal Forms Chapter 19 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh Database Management
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationFactorized Relational Databases Olteanu and Závodný, University of Oxford
November 8, 2013 Database Seminar, U Washington Factorized Relational Databases http://www.cs.ox.ac.uk/projects/fd/ Olteanu and Závodný, University of Oxford Factorized Representations of Relations Cust
More informationThe Count-Min-Sketch and its Applications
The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local
More informationIntroduction. Normalization. Example. Redundancy. What problems are caused by redundancy? What are functional dependencies?
Normalization Introduction What problems are caused by redundancy? UVic C SC 370 Dr. Daniel M. German Department of Computer Science What are functional dependencies? What are normal forms? What are the
More informationLecture 10 September 27, 2016
CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are
More informationEnd-biased Samples for Join Cardinality Estimation
End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison estan,naughton@cs.wisc.edu Abstract We present a new
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationSubmodularity in Machine Learning
Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity
More informationSketching and Streaming for Distributions
Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationBias Correction in Classification Tree Construction ICML 2001
Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity
More informationEffective computation of biased quantiles over data streams
Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com
More informationBiased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan
Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data
More informationCS 188: Artificial Intelligence Spring Announcements
CS 188: Artificial Intelligence Spring 2011 Lecture 16: Bayes Nets IV Inference 3/28/2011 Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Announcements
More informationapproximation algorithms I
SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions
More informationHow to Optimally Allocate Resources for Coded Distributed Computing?
1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern
More informationHomework 3. Convex Optimization /36-725
Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)
More informationThe Evils of Redundancy. Schema Refinement and Normalization. Functional Dependencies (FDs) Example: Constraints on Entity Set. Refining an ER Diagram
The Evils of Redundancy Schema Refinement and Normalization Chapter 1 Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus Redundancy is at the root of several problems
More informationCORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES
CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More information11 Heavy Hitters Streaming Majority
11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath
More information[Title removed for anonymity]
[Title removed for anonymity] Graham Cormode graham@research.att.com Magda Procopiuc(AT&T) Divesh Srivastava(AT&T) Thanh Tran (UMass Amherst) 1 Introduction Privacy is a common theme in public discourse
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More information0 1 d 010. h 0111 i g 01100
CMSC 5:Fall 07 Dave Mount Solutions to Homework : Greedy Algorithms Solution : The code tree is shown in Fig.. a 0000 c 0000 b 00 f 000 d 00 g 000 h 0 i 00 e Prob / / /8 /6 / Depth 5 Figure : Solution
More informationSummarizing Measured Data
Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,
More informationStreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory
StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table
More informationP P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions
Summary of the previous lecture Recall that we mentioned the following topics: P: is the set of decision problems (or languages) that are solvable in polynomial time. NP: is the set of decision problems
More informationQuiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions
Name CWID Quiz 2 Due November 26th, 2015 CS525 - Advanced Database Organization s Please leave this empty! 1 2 3 4 5 6 7 Sum Instructions Multiple choice questions are graded in the following way: You
More informationInteger Linear Programs
Lecture 2: Review, Linear Programming Relaxations Today we will talk about expressing combinatorial problems as mathematical programs, specifically Integer Linear Programs (ILPs). We then see what happens
More informationLecture #7 (Relational Design Theory, cont d.)
Introduction to Data Management Lecture #7 (Relational Design Theory, cont d.) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Announcements
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationA Pass-Efficient Algorithm for Clustering Census Data
A Pass-Efficient Algorithm for Clustering Census Data Kevin Chang Yale University Ravi Kannan Yale University Abstract We present a number of streaming algorithms for a basic clustering problem for massive
More informationCS246 Final Exam, Winter 2011
CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationLecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.
CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the
More informationSTATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS
STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS By LIXIA CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationOn the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation
On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation Mikael Fallgren Royal Institute of Technology December, 2009 Abstract
More informationRandom Sampling on Big Data: Techniques and Applications Ke Yi
: Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 V s: Volume Velocity Variety Integers, real numbers Points in a multi-dimensional
More informationTime-Decayed Correlated Aggregates over Data Streams
Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationData Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395
Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection
More informationLecture 2 Sept. 8, 2015
CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],
More informationWavelet decomposition of data streams. by Dragana Veljkovic
Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records
More informationSchema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19
Schema Refinement and Normal Forms [R&G] Chapter 19 CS432 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational schemas: redundant storage, insert/delete/update
More informationLinear Congruences. The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence:
Linear Congruences The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence: ax b (mod m), a, b Z, m N +. (1) If x 0 is a solution then so is x k :=
More informationAnnouncements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences
CS 188: Artificial Intelligence Spring 2011 Announcements Assignments W4 out today --- this is your last written!! Any assignments you have not picked up yet In bin in 283 Soda [same room as for submission
More informationImpression Store: Compressive Sensing-based Storage for. Big Data Analytics
Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in
More informationCS570 Introduction to Data Mining
CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning
More informationChapter 7: Relational Database Design
Chapter 7: Relational Database Design Chapter 7: Relational Database Design! First Normal Form! Pitfalls in Relational Database Design! Functional Dependencies! Decomposition! Boyce-Codd Normal Form! Third
More informationApproximate Query Processing Using Wavelets
Approximate Query Processing Using Wavelets Kaushik Chakrabarti Minos Garofalakis Rajeev Rastogi Kyuseok Shim Presented by Guanghua Yan Outline Approximate query processing: Problem and Prior solutions
More informationThe Evils of Redundancy. Schema Refinement and Normal Forms. Functional Dependencies (FDs) Example: Constraints on Entity Set. Example (Contd.
The Evils of Redundancy Schema Refinement and Normal Forms INFO 330, Fall 2006 1 Redundancy is at the root of several problems associated with relational schemas: redundant storage, insert/delete/update
More information12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria
12. LOCAL SEARCH gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley h ttp://www.cs.princeton.edu/~wayne/kleinberg-tardos
More informationPrivacy-Preserving Data Mining
CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,
More informationSequential: Vector of Bits
Counting the Number of Accesses Sequential: Vector of Bits When estimating seek costs, we need to calculate the probability distribution for the distance between two subsequent qualifying cylinders. We
More informationCMPUT651: Differential Privacy
CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged
More informationBayes Nets III: Inference
1 Hal Daumé III (me@hal3.name) Bayes Nets III: Inference Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 10 Apr 2012 Many slides courtesy
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2 Cost Estimation Based on
More informationSchema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) CIS 330, Spring 2004 Lecture 11 March 2, 2004
Schema Refinement and Normal Forms CIS 330, Spring 2004 Lecture 11 March 2, 2004 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational schemas: redundant storage,
More informationMarch 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.
Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)
More informationOn the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries
On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries Yossi Matias School of Computer Science Tel Aviv University Tel Aviv 69978, Israel matias@tau.ac.il Daniel Urieli School
More informationBehavioral Simulations in MapReduce
Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?
More informationData Analytics Beyond OLAP. Prof. Yanlei Diao
Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of
More informationConstraints: Functional Dependencies
Constraints: Functional Dependencies Spring 2018 School of Computer Science University of Waterloo Databases CS348 (University of Waterloo) Functional Dependencies 1 / 32 Schema Design When we get a relational
More informationRandomized Load Balancing:The Power of 2 Choices
Randomized Load Balancing: The Power of 2 Choices June 3, 2010 Balls and Bins Problem We have m balls that are thrown into n bins, the location of each ball chosen independently and uniformly at random
More information