Processing Aggregate Queries over Continuous Data Streams

Size: px
Start display at page:

Download "Processing Aggregate Queries over Continuous Data Streams"

Transcription

1 Processing Aggregate Queries over Continuous Data Streams Alin Dobra Computer Science Department Cornell University April 15, 2003

2 Relational Database Systems did dname 15 Legal 17 Marketing 3 Development Departments did eid ename did 10 Nicole Emma 15 5 Dan 15 2 Mike 17 Employees Join operator: did dname eid ename 15 Legal 11 Emma 15 Legal 5 Dan 17 Marketing 10 Nicole 17 Marketing 2 Mike Subset of cross product that satisfies constraint Relations contain no redundant information. Joins put information back together 2

3 Relational vs Streaming Data Relational Database Systems Relations reside on disk Access to data under complete control Queries refer to current state and posed once New Applications: network monitoring, sensor networks, telecom call records Data arrives in the form of continuous stream Order and rate or arrival not under the control of the system Queries are long running Most assumptions in relational database systems break Computation over Data Streams Frequency moments [AGM96], distinct values [FM96], norms [I00] Stream processing engines: NIAGARA, STREAM, Aurora 3

4 Processing Network Data Streams Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Challenge: deal with large amounts of streaming data using small memory approximate answers often suffice (e.g., trend/pattern analysis Other applications: one pass queries, sensor networks 4

5 Streaming Computational Model Synopses for R 1... Synopses for R r... Memory Stream for R 1 Stream for R 2 Stream for R r Stream Query-Processing Engine Approximate answers to queries Q 1,...,Q q Query Workload Q 1 Q q Computational model: Single Pass: each record examined at most once Fixed Order: no assumption about the arrival order can be made Small Space: log or poly-log in data stream size main memory algorithms 5

6 Class of Queries: Aggregates over Joins Equi-join J specified by relations R 1,...,R r and join constraints R 1.a 1 = R 2.a 2,... J = R 1 R r = { } t R 1 R r t. R1.a = t. 1 R2.a,... 2 R 1 a 1 a 2 R 2 Queries specified by a join and an aggregate: COUNT, SUM COUNT(R 1 R r is the number of tuples of the join Self join size: SJ(R = COUNT(R R 6

7 Example: Two-Way Join COUNT Query Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G = fg T = [ ] = = 13 Space requirement proportional with size of f and g Multi dimensional data space can be prohibitive E.g. Three attributes, each with domains of size words 7

8 Stream Data Synopses Frequency table maintenance over streams requires too much space Summarization required Conventional data summaries fall short Quantiles and 1-d histograms [MSL98], [GK 01], [GKMS02] Cannot capture attribute correlations Little support for approximation guarantees Samples (e.g., using Reservoir Sampling Perform poorly for non foreign key joins [AGMS99] Cannot handle deletions of records Multi-dimensional histograms/wavelets Construction requires multiple passes over the data Concurrent work [TGIK02] Our approach: use sketches [AMS96],[AGMS99] 8

9 Outline of the Talk Background Introduction to sketches Log space in the size of stream and size of domain of attributes Provable probabilistic guarantees Insertions and deletions possible Sketch partitioning Sketch sharing Future work 9

10 Recall Example Problem: Estimate result of query COUNT(F a G Example: Stream F: a , frequency vector f: Stream G: a , frequency vector g: i f i i g i COUNT(F a G= fg T = [ ] = = 13 10

11 Main Idea: Basic Sketching Technique (Alon, Gibbons, Matias, Szegedy 1999 Summarize information in the stream with a single number sketch Use sketches to recover query result Sketches: Choose random ξ 1,ξ 2,ξ 3 { 1,1} Stream F: a Stream G: a X F = ξ 1 + ξ 1 + ξ 2 + ξ 3 + ξ 1 + ξ 3 X G = ξ 3 + ξ 1 + ξ 3 + ξ 1 + ξ 1 Claim X = X F X G estimates COUNT(F a G 11

12 Analysis of Basic Sketching Technique Random vectors: ξ = [ ξ 1...ξ n ] random vector of ±1 values, called projection vector Sketches: Sketch of F, X F = t F ξ t.a = i f i ξ i = fξ T Sketch of G, X G = t G ξ t.a = i g i ξ i = gξ T X = X F X G estimates COUNT(F a G since if E[ξ T ξ ] = I i 1 i 2,E[ξ i1 ξ i2 ] = 0 X is not precise if i 1 i 2 i 3 i 4, E[ξ i1 ξ i2 ξ i3 ξ i4 ] = 0 E[X] = E[fξ T ξ g T ] = fe[ξ T ξ ]g T = fig T = fg T Var(X 2ff T gg T = 2 SJ(F SJ(G ξ need not have elements fully independent. 4-wise independence suffices ξ can be efficiently generated from small seeds [ABI86] ξ vector is not stored. Required elements generated on the fly from seeds 12

13 !!,-,- ( ( % %$ Sketch Error Reduction Standard Solution Estimation of COUNT(F a G from single sketches of F and G is too noisy Solution: Average 8 ε 2 Var(X E 2 [X] E 2 [X] Var(X independent copies of X to reduce error to ε Compute median of 2log1/δ such averages to increase confidence to 1 δ Stream F Sketches of F Seeds Stream G ξ ξ Sketches of G Independent copies of X Average Median COUNT(F G(1 ± ε with prob (1 δ Memory required weakly dependent on the size of the stream: logn + 1 bits/sketch 13

14 Properties of Sketches Advantages: Space requirement logarithmic in the size of the stream and of the domain of the attributes Provable probabilistic approximation guarantees Both insertion and deletion possible For deletion tuples simply decrement the sketch instead of incrementing Simplicity Disadvantages: They apply only to the query COUNT(F a G When Var(X E 2 [X] 1 a large amount of space required 14

15 Estimation of COUNT(R 1 R r (Dobra, Garofalakis, Gehrke, Rastogi, 2002 (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 15

16 Our Contributions Unless distributions in F and G are similar Var(X 1 E 2 [X] Need to produce a large number of copies of X Contributions: Two novel methods to reduce the error of estimates Sketch partitioning (Dobra, Garofkalakis, Gehrke, Rastogi, 2002 Use extra information to intelligently partition the problem The partitioning results in error reduction Sketch sharing (Dobra, Garofkalakis, Gehrke, Rastogi, 2003 Applicable when multiple queries are simultaneously answered Share sketches between queries thus saving space Multi query optimization problem 16

17 Outline of the Talk Background Introduction to sketches Sketch partitioning Use statistical information to reduce error Sketch sharing Future work 17

18 Sketch Partitioning [DGGR02] Large variance means loose estimation guarantees Building estimate with smaller variance reduces error Consider query: COUNT(F a G i f i g i E[X] = 165, E 2 [X] = Var(X 2 SJ(F SJ(G = 2( ( = Idea: Split I = {1,2,3,4} into I 1 = {1,3} and I 2 = {2,4} F I 1 F 1 X 1 I 2 F 2 X = X 1 + X 2 G I 1 G 1 I 2 G 2 E[X ] = COUNT(F 1 G 1 + COUNT(F 2 G 2 = COUNT(F G X 2 18

19 Sketch Partitioning (cont. Estimation of COUNT(F 1 G 1 i f i g i Var(X 1 2 SJ(F 1 SJ(G 1 = 2( ( = Estimation of COUNT(F 2 G 2 i f i g i Var(X 2 2 SJ(F 2 SJ(G 2 = 2( ( = Var(X = Var(X 1 + Var(X 2 = Improvement Var(X/m Var(X /(m/2 = ε Intelligently splitting the domain of the join attribute reduction in error Question: Assuming we know f and g, what is the best way to split? 19

20 Binary Sketch Partitioning Optimization Problem: Find partition I = I 1 I 2 space allocation m 1 for X 1, m 2 for X 2 s.t. m 1 + m 2 = m that minimizes where Var(X 1 m 1 + Var(X 2 m 2, Var(X k 2 fi 2 i I k i I k g 2 i. Solution: Allocate space proportional to Var(X k Transformed optimization criterion: Ψ(I 1 = fi 2 g 2 i + i I 1 i I 1 Naive solution: check all 2 I ways to partition fi 2 g 2 i i I 2 i I 2 20

21 Breiman s Theorem[BFOS84]: the problem has the property that: Binary Sketch Partitioning argmin I1,I 2 i I 1 q i Φ For q i > 0,r i > 0 and Φ(x concave, an optimum of ( i I1 q i r i i I1 q i + i I 2 q i Φ i I 1, j I 2,r i < r j ( i I2 q i r i i I2 q i Corollary: The solution of the optimization problem can be found in O( I log I by sorting i I in increasing order of r i and considering splits only in this order. Application 1: Finding the split point for discrete attribute in classification tree construction. Taking Φ(x = 2x(1 x, q i = P[X = x i ], r i = P[C = c 0 X = x i ] the criterion is gini index (impurity after split. Application 2: Find the optimal binary partitioning: Φ(x = x, q i = f i 2 i I fi 2, r i = g2 i i f 2, criterion becomes Ψ(I 1 i I f 2 i 21

22 Algorithm: Binary Sketch Partitioning Algorithm Order elements of I in increasing order of g i f i Choose the partitioning in this order that minimizes Ψ(I 1 For the optimal partitioning take memory allocation proportional to Var(X k Example: Optimal partition i g i f i Optimal memory allocation 5 : 6 Ordering within I: 1,3,2,4 Optimal partition: I 1 = {1,3}, I 2 = {2,4} Considering splits only in this order matches the intuition 22

23 Sketch Partitioning: Extensions Using Imperfect Information: Can get good partitioning without knowledge of perfect frequency tables Use rough unidimensional histograms instead time and space dependency on number of buckets instead of I provable approximation quality Generalization: K-ary split partitioning Needs generalization of Breiman s Theorem Dynamic programming algorithm Multi-way joins NP-hard With independence assumption becomes tractable decomposes into independent unidimensional problems 23

24 Experimental Study Datasets: Synthetic data sets [Vitter & Wang 99]: Frequency Rectangular regions in multi-dimensional attribute space uniformly distributed. Domain size Attribute Attribute Each region has a Zipfian distribution Number of tuples in each region Zipf distributed Total number of tuples: 10M To simulate correlations we perturbed the placement of regions Census data set ( reported in [DGGR02] Comparison: Query load: estimation using basic method JOIN-COUNT queries Error metric: relative error = 100 actual approx actual % 24

25 Join of two independent unidimensional relations on a common attribute buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 4000 words. 45% - 65% improvement Partitioning in two good enough; finer histograms do not help much 25

26 Chain join of three bidimentional relations on two common attributes buckets 50 buckets 100 buckets 30 Relative error (% Number of partitions Space for sketches 9000 words. 30% - 50% improvement Same trends 26

27 Outline of the Talk Background Introduction to sketches Sketch partitioning Sketch sharing Exploit commonality in the queries to reduce error Future work 27

28 Answering multiple queries simultaneously [DGGR, under review] Can apply the basic technique for each join independently COUNT(R 1 R 2 R 1 a ξ (1 a R 2 t R1 ξ (1 t.a t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 R 1 R 3 R a ξ (2 a b ξ (3 a 2 t R1 ξ (2 t.a t R3 ξ (2 t.a ξ (3 t.b t R2 ξ (3 t.a If sketches are reused fewer sketches have to be maintained The total available space is divided between fewer sketches Question: How and when can sketches be reused? 28

29 Sketch Sharing R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a Sharing the sketch of R 1 for both joins Use ξ (1 to enforce both <R 1.a,R 2.a> and <R 1.a,R 3.a> constraints Number of sketches reduced from 5 to 4 Each sketch gets 25% more space 29

30 Sketch Sharing R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b Sketch sharing for R 1 ξ (1 on both <R 1.a,R 2.a> and <R 3.b,R 2.a> edges ξ (1 used on all edges Number of sketches reduced from 5 to 3 Each sketch gets 66% more space Simple sketch sharing algorithm 30

31 Sketch Sharing Problem: Estimate COUNT(R 1 R 2 R 3 = f R1 f T R 2 f T R 3 R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Φ = E [ COUNT(R 1 R 3 R 2 ξ t.a ξ t.a ξ t.b ξ t.a t R 1 t R 2 t R 3 = E[f R1 ξ T ξ f T R 2 ξ T ξ f T R 3 ] = f R1 E[ξ T ξ f T R 2 ξ T ξ ]f T R 3 ] t R3 ξ (1 t.a ξ (1 t.b f R1 f T R 2 f T R 3 since E[ξ T ξ f T R 2 ξ T ξ ] f T R 2 Φ COUNT(R 1 R 2 R 3 sharing sketch for R 2 results in wrong estimation Apparent reason: ξ (1 assigned to both constraints of red join 31

32 Estimation of COUNT(R 1 R r (1 (2 ξ ξ a 1 a 2 a m ξ (m R l X l = t RL ξ (1 t.a 1...ξ (m t.a m With X = r l=1 X l can show: E[X] = COUNT(R1 Rr Var(X C r l=1 SJ(R l Assign a distinct projection vector ξ to each join constraint 32

33 Sketch Sharing: Conflicts in Sharing Schemes Correctness: within each join a different ξ -family is assigned to each join constraint. R 1 t R1 ξ (1 t.a a a ξ (1 a R 2 t R2 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 ξ (1 a R 3 b ξ (2 a R 2 No conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (2 t.b t R2 ξ (2 t.a R 1 t R1 ξ (1 t.a COUNT(R 1 R 2 COUNT(R 1 R 3 R 2 a ξ (1 a a a ξ (1 a R 3 b ξ (1 R 2 t R2 ξ (1 t.a Conflict. COUNT(R 1 R 3 R 2 t R3 ξ (1 t.a ξ (1 t.b 33

34 Sketch Sharing: Algorithm Problem: Given a set of queries to be approximated compute, the sketch sharing scheme with no conflicts that has the smaller number of sketches. Characterization: Problem is NP-hard. Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible 34

35 Key Observation: Space Allocation Problem Allocating identical space to each sketch might not optimize average error/max error Relative square error for Q ε 2 Q Var(X M Q E 2 [X] = W Q M Q m v memory allocation of sketch v V, M Q memory allocation for query Q Q Space allocation problem: Find m v,v V and M Q,Q Q that minimizes either Average Error: W Q, or Q Q M Q W Q Max Error: max Q Q M Q such that v V, Q Q(v : M Q m v (sketch-query constraint v V m v = m (total memory constraint 35

36 Solving the Space Allocation Problem Can integrate space allocation problems into sketch sharing algorithm Heuristic Algorithm: 1. Start with no sketch sharing 2. Find pair of sketches that can be shared (if any Avoid introducing conflicts 3. Repeat Step 2 until no sharing possible In Step 2 of algorithm solve the space allocation problem for each pair pick among the candidate pairs the one that reduces error the most Minimize max error: reduces to classic optimization problem Can be solved in O( V log V Minimize average error: complex convex optimization problem Integer version NP-hard Continuous version gives solution with good approximation efficient custom solution 36

37 Continuous Average Error Problem Optimization problem: Strictly convex optimization problem min W Q Φ(M Q Q Q v V, Q Q(v : v V m v = m M Q m v Solvable in polynomial time with general convex solvers. Slow in practice Solution satisfies the Karush-Kuhn-Tucker conditions: Q Q : W Q Φ (M Q + v V (Q µ v,q = 0 v V : Q Q(v µ v,q = λ Q Q, v J : µ v,q (m v M Q = 0 m v = M v J 37

38 Continuous Average Error Problem KKT equations not directly solvable Key idea: define equivalence relation v V and Q Q(v are equivalent if m v = M Q in optimal solution + closure Induces equivalence classes C 1,C 2,... over V Q C 2 C 4 C 1 C 3 C 7 C 5 C 6 Using KKT conditions can show: Is enough to find the equivalence classes C 1,C 2,... to get the solution Can get conditions that equivalence classes satisfy Can setup a Max-flow problem that finds divider between classes 38

39 Experiments TPC-H Frequency PART Supplier Attribute PARTKEY 1 SUPPKEY NATIONKEY Attribute PARTKEY 9 PARTSUP SUPPKEY 8 CUSTOMER PARTKEY 7 SUPPKEY 6 CUSTKEY 4 LINEITEM ORDERKEY 5 ORDERS ORDERKEY 10 Q 1 1,2 Q 9 1 Q 17 8,9 Q 25 2,7 Q 2 4,5 Q 10 6,7 Q 18 5,9 Q 26 1,6 Q 3 3,4,5 Q 11 5,8 Q 19 6,8 Q 27 3,8 Q 4 4,5,8 Q Q 20 7,8 Q 28 1,2,3 Q 5 4,5,8,9 Q 13 4 Q 21 8 Q 29 2,3,4 Q 6 2 Q 14 3 Q 22 6 Q 7 5 Q 15 3,4 Q 23 7 Q 8 9 Q 16 5,8 Q 24 2,3 W 1 = Q 1 : Q 12, W 2 = Q 1 : Q 29 39

40 Average Error for Workload W sketch sharing no sharing 25 Relative Error(% Total Memory(words Relative error reduced by 35% Number of sketches reduced from 34 to 16 40

41 Average Error for Workload W sketch sharing no sharing 40 Relative Error(% Total Memory(words Relative error reduced by 45%. Larger reduction Number of sketches reduced from 82 to 25 41

42 Summary Basic sketch technique Summarize streams by projecting them on random vectors Log space, provable estimation guarantees, deletions Require a lot of space when variance is large Contributions: Sketch partitioning Uses extra information to intelligently split domain of attributes Results in better approximation scheme Sketch sharing Exploits commonality in queries Space and computation is shared among queries Both methods lead to interesting optimization problems 42

43 Future Work: Applications Data-mining Monitoring Statistics, Correlations Aggregates over Joins Aggregates over joins are building blocks for higher level applications Useful for correlating information from multiple sources Integrate approximate query processing with data-mining and monitoring 43

44 Future Work: Distributed Computation Sketches are linear: SK(D 1 D 2 = SK(D 1 + SK(D 2 Sketches are useful for distributed computation No need to move data. Send sketches instead Computation of aggregates over joins as easy as computation of SUM Measurement Alarms Network Operations Center Data Stream Join Query SELECT COUNT(* FROM R 1, R 2, R 3 WHERE R 1.a = R 2.b = R 3.c R 3 Telco/LAN Router R 1 R2 Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Telco/LAN Router Network Monitoring Peer-to-peer/sensor network Gossip-style decentralized algorithms for aggregate queries (with Kempe & Gehrke 44

45 Future work Extensions Combination with other synopses data-structures Distinct value queries Retroactive queries Improvements in the presence of extra knowledge Types of extra knowledge Schema information: foreign keys, size of relations Statistical information Workload information Find novel ways to use extra knowledge Approximate stream query processing system Integration into existing system or new system 45

46 Questions 46

Sketching Sampled Data Streams

Sketching Sampled Data Streams Sketchng Sampled Data Streams Florn Rusu and Aln Dobra CISE Department Unversty of Florda March 31, 2009 Motvaton & Goal Motvaton Multcore processors How to use all the processng power? Parallel algorthms

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1)

Correlated subqueries. Query Optimization. Magic decorrelation. COUNT bug. Magic example (slide 2) Magic example (slide 1) Correlated subqueries Query Optimization CPS Advanced Database Systems SELECT CID FROM Course Executing correlated subquery is expensive The subquery is evaluated once for every CPS course Decorrelate!

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Databases 2011 The Relational Algebra

Databases 2011 The Relational Algebra Databases 2011 Christian S. Jensen Computer Science, Aarhus University What is an Algebra? An algebra consists of values operators rules Closure: operations yield values Examples integers with +,, sets

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) = CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14 1 Probability First, recall a couple useful facts from last time about probability: Linearity of expectation: E(aX + by ) = ae(x)

More information

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington

Topics in Probabilistic and Statistical Databases. Lecture 9: Histograms and Sampling. Dan Suciu University of Washington Topics in Probabilistic and Statistical Databases Lecture 9: Histograms and Sampling Dan Suciu University of Washington 1 References Fast Algorithms For Hierarchical Range Histogram Construction, Guha,

More information

Schema Refinement and Normal Forms

Schema Refinement and Normal Forms Schema Refinement and Normal Forms Chapter 19 Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational

More information

Progressive & Algorithms & Systems

Progressive & Algorithms & Systems University of California Merced Lawrence Berkeley National Laboratory Progressive Computation for Data Exploration Progressive Computation Online Aggregation (OLA) in DB Query Result Estimate Result ε

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks

Causality & Concurrency. Time-Stamping Systems. Plausibility. Example TSS: Lamport Clocks. Example TSS: Vector Clocks Plausible Clocks with Bounded Inaccuracy Causality & Concurrency a b exists a path from a to b Brad Moore, Paul Sivilotti Computer Science & Engineering The Ohio State University paolo@cse.ohio-state.edu

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

Schema Refinement and Normal Forms. Chapter 19

Schema Refinement and Normal Forms. Chapter 19 Schema Refinement and Normal Forms Chapter 19 1 Review: Database Design Requirements Analysis user needs; what must the database do? Conceptual Design high level descr. (often done w/er model) Logical

More information

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd.

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Example (Contd. The Evils of Redundancy Schema Refinement and Normal Forms Chapter 19 Database Management Systems, 3ed, R. Ramakrishnan and J. Gehrke 1 Redundancy is at the root of several problems associated with relational

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram

The Evils of Redundancy. Schema Refinement and Normal Forms. Example: Constraints on Entity Set. Functional Dependencies (FDs) Refining an ER Diagram Schema Refinement and Normal Forms Chapter 19 Database Management Systems, R. Ramakrishnan and J. Gehrke 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Schema Refinement and Normal Forms Chapter 19

Schema Refinement and Normal Forms Chapter 19 Schema Refinement and Normal Forms Chapter 19 Instructor: Vladimir Zadorozhny vladimir@sis.pitt.edu Information Science Program School of Information Sciences, University of Pittsburgh Database Management

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Factorized Relational Databases Olteanu and Závodný, University of Oxford

Factorized Relational Databases   Olteanu and Závodný, University of Oxford November 8, 2013 Database Seminar, U Washington Factorized Relational Databases http://www.cs.ox.ac.uk/projects/fd/ Olteanu and Závodný, University of Oxford Factorized Representations of Relations Cust

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

Introduction. Normalization. Example. Redundancy. What problems are caused by redundancy? What are functional dependencies?

Introduction. Normalization. Example. Redundancy. What problems are caused by redundancy? What are functional dependencies? Normalization Introduction What problems are caused by redundancy? UVic C SC 370 Dr. Daniel M. German Department of Computer Science What are functional dependencies? What are normal forms? What are the

More information

Lecture 10 September 27, 2016

Lecture 10 September 27, 2016 CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are

More information

End-biased Samples for Join Cardinality Estimation

End-biased Samples for Join Cardinality Estimation End-biased Samples for Join Cardinality Estimation Cristian Estan, Jeffrey F. Naughton Computer Sciences Department University of Wisconsin-Madison estan,naughton@cs.wisc.edu Abstract We present a new

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Submodularity in Machine Learning

Submodularity in Machine Learning Saifuddin Syed MLRG Summer 2016 1 / 39 What are submodular functions Outline 1 What are submodular functions Motivation Submodularity and Concavity Examples 2 Properties of submodular functions Submodularity

More information

Sketching and Streaming for Distributions

Sketching and Streaming for Distributions Sketching and Streaming for Distributions Piotr Indyk Andrew McGregor Massachusetts Institute of Technology University of California, San Diego Main Material: Stable distributions, pseudo-random generators,

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Bias Correction in Classification Tree Construction ICML 2001

Bias Correction in Classification Tree Construction ICML 2001 Bias Correction in Classification Tree Construction ICML 21 Alin Dobra Johannes Gehrke Department of Computer Science Cornell University December 15, 21 Classification Tree Construction Outlook Temp. Humidity

More information

Effective computation of biased quantiles over data streams

Effective computation of biased quantiles over data streams Effective computation of biased quantiles over data streams Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com

More information

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan

Biased Quantiles. Flip Korn Graham Cormode S. Muthukrishnan Biased Quantiles Graham Cormode cormode@bell-labs.com S. Muthukrishnan muthu@cs.rutgers.edu Flip Korn flip@research.att.com Divesh Srivastava divesh@research.att.com Quantiles Quantiles summarize data

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 16: Bayes Nets IV Inference 3/28/2011 Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Announcements

More information

approximation algorithms I

approximation algorithms I SUM-OF-SQUARES method and approximation algorithms I David Steurer Cornell Cargese Workshop, 201 meta-task encoded as low-degree polynomial in R x example: f(x) = i,j n w ij x i x j 2 given: functions

More information

How to Optimally Allocate Resources for Coded Distributed Computing?

How to Optimally Allocate Resources for Coded Distributed Computing? 1 How to Optimally Allocate Resources for Coded Distributed Computing? Qian Yu, Songze Li, Mohammad Ali Maddah-Ali, and A. Salman Avestimehr Department of Electrical Engineering, University of Southern

More information

Homework 3. Convex Optimization /36-725

Homework 3. Convex Optimization /36-725 Homework 3 Convex Optimization 10-725/36-725 Due Friday October 14 at 5:30pm submitted to Christoph Dann in Gates 8013 (Remember to a submit separate writeup for each problem, with your name at the top)

More information

The Evils of Redundancy. Schema Refinement and Normalization. Functional Dependencies (FDs) Example: Constraints on Entity Set. Refining an ER Diagram

The Evils of Redundancy. Schema Refinement and Normalization. Functional Dependencies (FDs) Example: Constraints on Entity Set. Refining an ER Diagram The Evils of Redundancy Schema Refinement and Normalization Chapter 1 Nobody realizes that some people expend tremendous energy merely to be normal. Albert Camus Redundancy is at the root of several problems

More information

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES

CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES CORRELATION-AWARE STATISTICAL METHODS FOR SAMPLING-BASED GROUP BY ESTIMATES By FEI XU A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

11 Heavy Hitters Streaming Majority

11 Heavy Hitters Streaming Majority 11 Heavy Hitters A core mining problem is to find items that occur more than one would expect. These may be called outliers, anomalies, or other terms. Statistical models can be layered on top of or underneath

More information

[Title removed for anonymity]

[Title removed for anonymity] [Title removed for anonymity] Graham Cormode graham@research.att.com Magda Procopiuc(AT&T) Divesh Srivastava(AT&T) Thanh Tran (UMass Amherst) 1 Introduction Privacy is a common theme in public discourse

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

0 1 d 010. h 0111 i g 01100

0 1 d 010. h 0111 i g 01100 CMSC 5:Fall 07 Dave Mount Solutions to Homework : Greedy Algorithms Solution : The code tree is shown in Fig.. a 0000 c 0000 b 00 f 000 d 00 g 000 h 0 i 00 e Prob / / /8 /6 / Depth 5 Figure : Solution

More information

Summarizing Measured Data

Summarizing Measured Data Summarizing Measured Data 12-1 Overview Basic Probability and Statistics Concepts: CDF, PDF, PMF, Mean, Variance, CoV, Normal Distribution Summarizing Data by a Single Number: Mean, Median, and Mode, Arithmetic,

More information

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory

StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory StreamSVM Linear SVMs and Logistic Regression When Data Does Not Fit In Memory S.V. N. (vishy) Vishwanathan Purdue University and Microsoft vishy@purdue.edu October 9, 2012 S.V. N. Vishwanathan (Purdue,

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1396 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1396 1 / 44 Table

More information

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions Summary of the previous lecture Recall that we mentioned the following topics: P: is the set of decision problems (or languages) that are solvable in polynomial time. NP: is the set of decision problems

More information

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions

Quiz 2. Due November 26th, CS525 - Advanced Database Organization Solutions Name CWID Quiz 2 Due November 26th, 2015 CS525 - Advanced Database Organization s Please leave this empty! 1 2 3 4 5 6 7 Sum Instructions Multiple choice questions are graded in the following way: You

More information

Integer Linear Programs

Integer Linear Programs Lecture 2: Review, Linear Programming Relaxations Today we will talk about expressing combinatorial problems as mathematical programs, specifically Integer Linear Programs (ILPs). We then see what happens

More information

Lecture #7 (Relational Design Theory, cont d.)

Lecture #7 (Relational Design Theory, cont d.) Introduction to Data Management Lecture #7 (Relational Design Theory, cont d.) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Announcements

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

A Pass-Efficient Algorithm for Clustering Census Data

A Pass-Efficient Algorithm for Clustering Census Data A Pass-Efficient Algorithm for Clustering Census Data Kevin Chang Yale University Ravi Kannan Yale University Abstract We present a number of streaming algorithms for a basic clustering problem for massive

More information

CS246 Final Exam, Winter 2011

CS246 Final Exam, Winter 2011 CS246 Final Exam, Winter 2011 1. Your name and student ID. Name:... Student ID:... 2. I agree to comply with Stanford Honor Code. Signature:... 3. There should be 17 numbered pages in this exam (including

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions. CSE533: Information Theory in Computer Science September 8, 010 Lecturer: Anup Rao Lecture 6 Scribe: Lukas Svec 1 A lower bound for perfect hash functions Today we shall use graph entropy to improve the

More information

STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS

STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS STATISTICAL APPROXIMATIONS OF DATABASE QUERIES WITH CONFIDENCE INTERVALS By LIXIA CHEN A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation

On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation On the complexity of maximizing the minimum Shannon capacity in wireless networks by joint channel assignment and power allocation Mikael Fallgren Royal Institute of Technology December, 2009 Abstract

More information

Random Sampling on Big Data: Techniques and Applications Ke Yi

Random Sampling on Big Data: Techniques and Applications Ke Yi : Techniques and Applications Ke Yi Hong Kong University of Science and Technology yike@ust.hk Big Data in one slide The 3 V s: Volume Velocity Variety Integers, real numbers Points in a multi-dimensional

More information

Time-Decayed Correlated Aggregates over Data Streams

Time-Decayed Correlated Aggregates over Data Streams Time-Decayed Correlated Aggregates over Data Streams Graham Cormode AT&T Labs Research graham@research.att.com Srikanta Tirthapura Bojian Xu ECE Dept., Iowa State University {snt,bojianxu}@iastate.edu

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Wavelet decomposition of data streams. by Dragana Veljkovic

Wavelet decomposition of data streams. by Dragana Veljkovic Wavelet decomposition of data streams by Dragana Veljkovic Motivation Continuous data streams arise naturally in: telecommunication and internet traffic retail and banking transactions web server log records

More information

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) [R&G] Chapter 19 Schema Refinement and Normal Forms [R&G] Chapter 19 CS432 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational schemas: redundant storage, insert/delete/update

More information

Linear Congruences. The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence:

Linear Congruences. The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence: Linear Congruences The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence: ax b (mod m), a, b Z, m N +. (1) If x 0 is a solution then so is x k :=

More information

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences

Announcements. CS 188: Artificial Intelligence Spring Bayes Net Semantics. Probabilities in BNs. All Conditional Independences CS 188: Artificial Intelligence Spring 2011 Announcements Assignments W4 out today --- this is your last written!! Any assignments you have not picked up yet In bin in 283 Soda [same room as for submission

More information

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics Impression Store: Compressive Sensing-based Storage for Big Data Analytics Jiaxing Zhang, Ying Yan, Liang Jeff Chen, Minjie Wang, Thomas Moscibroda & Zheng Zhang Microsoft Research The Curse of O(N) in

More information

CS570 Introduction to Data Mining

CS570 Introduction to Data Mining CS570 Introduction to Data Mining Department of Mathematics and Computer Science Li Xiong Data Exploration and Data Preprocessing Data and Attributes Data exploration Data pre-processing Data cleaning

More information

Chapter 7: Relational Database Design

Chapter 7: Relational Database Design Chapter 7: Relational Database Design Chapter 7: Relational Database Design! First Normal Form! Pitfalls in Relational Database Design! Functional Dependencies! Decomposition! Boyce-Codd Normal Form! Third

More information

Approximate Query Processing Using Wavelets

Approximate Query Processing Using Wavelets Approximate Query Processing Using Wavelets Kaushik Chakrabarti Minos Garofalakis Rajeev Rastogi Kyuseok Shim Presented by Guanghua Yan Outline Approximate query processing: Problem and Prior solutions

More information

The Evils of Redundancy. Schema Refinement and Normal Forms. Functional Dependencies (FDs) Example: Constraints on Entity Set. Example (Contd.

The Evils of Redundancy. Schema Refinement and Normal Forms. Functional Dependencies (FDs) Example: Constraints on Entity Set. Example (Contd. The Evils of Redundancy Schema Refinement and Normal Forms INFO 330, Fall 2006 1 Redundancy is at the root of several problems associated with relational schemas: redundant storage, insert/delete/update

More information

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria 12. LOCAL SEARCH gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley h ttp://www.cs.princeton.edu/~wayne/kleinberg-tardos

More information

Privacy-Preserving Data Mining

Privacy-Preserving Data Mining CS 380S Privacy-Preserving Data Mining Vitaly Shmatikov slide 1 Reading Assignment Evfimievski, Gehrke, Srikant. Limiting Privacy Breaches in Privacy-Preserving Data Mining (PODS 2003). Blum, Dwork, McSherry,

More information

Sequential: Vector of Bits

Sequential: Vector of Bits Counting the Number of Accesses Sequential: Vector of Bits When estimating seek costs, we need to calculate the probability distribution for the distance between two subsequent qualifying cylinders. We

More information

CMPUT651: Differential Privacy

CMPUT651: Differential Privacy CMPUT65: Differential Privacy Homework assignment # 2 Due date: Apr. 3rd, 208 Discussion and the exchange of ideas are essential to doing academic work. For assignments in this course, you are encouraged

More information

Bayes Nets III: Inference

Bayes Nets III: Inference 1 Hal Daumé III (me@hal3.name) Bayes Nets III: Inference Hal Daumé III Computer Science University of Maryland me@hal3.name CS 421: Introduction to Artificial Intelligence 10 Apr 2012 Many slides courtesy

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 4: Query Optimization Query Optimization Cost estimation Strategies for exploring plans Q min CS 347 Notes 4 2 Cost Estimation Based on

More information

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) CIS 330, Spring 2004 Lecture 11 March 2, 2004

Schema Refinement and Normal Forms. The Evils of Redundancy. Functional Dependencies (FDs) CIS 330, Spring 2004 Lecture 11 March 2, 2004 Schema Refinement and Normal Forms CIS 330, Spring 2004 Lecture 11 March 2, 2004 1 The Evils of Redundancy Redundancy is at the root of several problems associated with relational schemas: redundant storage,

More information

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang. Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)

More information

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries

On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries On the Optimality of the Greedy Heuristic in Wavelet Synopses for Range Queries Yossi Matias School of Computer Science Tel Aviv University Tel Aviv 69978, Israel matias@tau.ac.il Daniel Urieli School

More information

Behavioral Simulations in MapReduce

Behavioral Simulations in MapReduce Behavioral Simulations in MapReduce Guozhang Wang, Marcos Vaz Salles, Benjamin Sowell, Xun Wang, Tuan Cao, Alan Demers, Johannes Gehrke, Walker White Cornell University 1 What are Behavioral Simulations?

More information

Data Analytics Beyond OLAP. Prof. Yanlei Diao

Data Analytics Beyond OLAP. Prof. Yanlei Diao Data Analytics Beyond OLAP Prof. Yanlei Diao OPERATIONAL DBs DB 1 DB 2 DB 3 EXTRACT TRANSFORM LOAD (ETL) METADATA STORE DATA WAREHOUSE SUPPORTS OLAP DATA MINING INTERACTIVE DATA EXPLORATION Overview of

More information

Constraints: Functional Dependencies

Constraints: Functional Dependencies Constraints: Functional Dependencies Spring 2018 School of Computer Science University of Waterloo Databases CS348 (University of Waterloo) Functional Dependencies 1 / 32 Schema Design When we get a relational

More information

Randomized Load Balancing:The Power of 2 Choices

Randomized Load Balancing:The Power of 2 Choices Randomized Load Balancing: The Power of 2 Choices June 3, 2010 Balls and Bins Problem We have m balls that are thrown into n bins, the location of each ball chosen independently and uniformly at random

More information