Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets

Size: px
Start display at page:

Download "Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets"

Transcription

1 Monotone Estimation Framework and Applications for Scalable Analytics of Large Data Sets Edith Cohen Google Research Tel Aviv University

2 A Monotone Sampling Scheme Data domain V( R + ) v random seed u ~ U 0,1 Outcome S(v, u) : function of the data v and seed u Seed value u is available with the outcome S(v, u) can be interpreted as the set of all data vectors consistent with the outcome and u Monotonicity: Fixing v, S(v, u) is non-increasing with u.

3 Monotone Estimation Problem (MEP) A monotone sampling scheme (V, S) : Data domain V( R + ) Sampling scheme S: V [0,1], A nonnegative function f: V 0 Goal: estimate f v from the sample Data v Desired properties of the estimator f 6 (S): Sample S estimator Q: f(v)? f 6 (S v, u, u) > Unbiased v, f 6 S v, x, x dx = f v (useful with sums of MEPs)? Nonnegative keep estimate f 6 in the same domain as f (Pareto) optimal (admissible) any estimator with smaller var? C f6 S v, u has for some v, larger var? C E, f6 S(v, u)

4 Bounds on f(v) from S and u Data v. The lower the seed u is, the more we know on v and hence on f(v). f(v) Information on f(v) u 1

5 Estimators for MEPs Unbiased, Nonnegative, Bounded variance Admissible: Pareto Optimal in terms of variance Results preview: Explicit expressions for estimators for any MEP for which such estimator exists Solution is not unique. Consider some estimators with natural properties if we require monotonicity - f 6 (S, u) is non-increasing with u, we get uniqueness Notion of Competitiveness of estimators We will come back to this, but first see some applications

6 MEP applications in data analysis Scalable computation of approximate statistics and queries over large data sets Data is sampled (composable, distributed scheme). Sample is used to estimate statistics/queries expressed as a sum of multiple MEPs Key-value pairs with multiple sets of values (instances) Take coordinated samples of instances. We get a MEP for each key Sketching graph-based influence and similarity functions Distance sketch the utility values (relations of node to all others). Get a MEP for each target node from sketches of seed nodes Sketching generalized coverage functions Coordinated weighted sample of the utility vector of each element. MEP for each item.

7 Social/Communication data Activity value v(x) is associated with each key x = (b, c) (e.g. number of messages, communication from b to c) Take a weighted sample of keys. For example bottom-k ( weighed reservoir ) or PPS (Probability Proportional to Size) Monday activity (a,b) 40 (f,g) 5 (h,c) 20 (a,z) 10 (h,f) 10 (f,s) 10 For τ > 0, iid u x U[0,1]: x S v x τ u(x) Monday Sample: (a,b) 40 (a,z) 10.. (f,s) 10 With bottom-k, τ is set to obtain a fixed sample size k Without replacement sampling: v x τ ln u x Fully composable sampling scheme

8 Samples of multiple days Coordinated samples: Different values for different days. Each key is sampled with same seed u(x) in different days Monday activity (a,b) 40 (f,g) 5 (h,c) 20 (a,z) 10 (h,f) 10 (f,s) 10 Monday Sample: (a,b) 40 (a,z) 10.. (f,s) 10 Tuesday activity (a,b) 3 (f,g) 5 (g,c) 10 (a,z) 50 (s,f) 20 (g,h) 10 Tuesday Sample: (g,c) (a,z) 50.. (g,h) Wednesday activity (a,b) 30 (g,c) 5 (h,c) 10 (a,z) 10 (b,f) 20 (d,h) 10 Wednesday Sample: (a,b) 30 (b,f) 20.. (d,h) 10

9 Matrix view keys instances In our example: keys x = (a, b) are user pairs. Instances are days. Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

10 Example Statistics Specify a segment of the keys Y X, examples: one user in CA and one in NY apple device to android Queries/Statistics S T f(v > x, v U x,, v + x ) Total communication of segment on Wednesday. S T v > x X L X distance/weighted Jaccard change in activity of segment between Friday and Saturday S T v >(x) v U (x) X L X X increase/decrease S T Coverage of segment Y in days D : S T max {0, v >(x) v U (x)} X max a b v a(x) Average/sum of median/max/min/top-3/concave aggregate of activity values over days D We would like to compute an estimate from the sample

11 Matrix view keys instances Coordinated PPS sample τ = 100 for all entries u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

12 Estimate sum statistics, one key at a time f f(v(x)) S T Sum over keys x Y of f(v x ), where v(x) = (v > x, v U (x) ) For L X distance: f(v) = v > v U X Estimate one key at a time: f f 6 (S(v(x)) S T The estimator for f(v) is applied to the sample of v

13 Easy statistics: Sum over entries Estimate a single entry at a time Example: Total communication of segment Y on Monday Inverse probability estimate (Horviz Thompson) [HT52]: Sum over sampled x Y of v monday x p monday (x) Inclusion Probability p monday x can be computed from v x and τ : x S v x τ u(x) p a x = Pr? C [v a x τ i u x ]

14 HT estimator (single-instance) Coordinated PPS sample τ = 100 u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

15 HT estimator (single-instance) τ = 100. Day: Wednesday, Segment: CA-NY u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h)

16 HT estimator for single-instance τ = 100. Day: Wednesday, Segment: CA-NY u We (a,b) 43 (g,c) 0 (h,c) 60 (a,z) 24 (h,f) 3 (f,s) 20 (d,h) 0 p = 0.43 p = 0.20 Exact: = 123 HT estimate is 0 for keys that are not sampled, v/p when key is sampled HT estimate: = 200

17 Inverse-Probability (HT) estimator ü Unbiased: 1 p x 0 + p x z { S X S = f(v x ) ü Nonnegative: v x 0 so { S X S 0 ü Bounded variance (for all v) ü Monotone: more information higher estimate ü Optimal: UMVU The unique minimum variance (unbiased, nonnegative, sum) estimator Works when f depends on a single entry. What about general f?

18 Queries involving multiple columns L X X distance f(v) = v> v U X L X X increase f(v) = max{0, v > v U } X HT estimate is positive only when we know f v = v > v U from the sample. But for v U = 0, v > > 0 then f v > 0 but sample never reveals f v because second entry is never sampled. Thus, HT is biased Even when unbiased, HT may not be optimal. E.g. when v > is sampled and we can deduce from τ U and u that v U a < v > then we know that f v v > a. An optimal estimator will use this incomplete information We want estimators with the same nice properties as HT and optimality

19 Sampled data Coordinated PPS sample τ = 100 u Su Mo Tu We Th Fr Sa (a,b) (g,c) (h,c) (a,z) (h,f) (f,s) (d,h) Want to estimate U U U Lets look at key (a,z), and estimating

20 Information on f Fix the data v. The lower u is, the more we know on v and on f v = U = 81. We plot the lower bound we have on f(v) as a function of the seed u u

21 This is a MEP! Monotone Estimation Problem A monotone sampling scheme (V, S) : Data domain V( R + ) here v >, v U R E Sampling scheme S: V [0,1], here S((v >, v U ), u) reveals v a when v a > 100 u U A nonnegative function f: V 0 here v > v U Goal: estimate f(v): specify an estimator f 6 (S, u) that is Unbiased, Nonnegative, Bounded variance, Admissible (optimal) U Solution is not unique.

22 The optimal (admissible) range Cum. Est For x > u We see S(v, u) and u. We know what S(v, x) is for all x > u. >? Suppose we fixed M = f 6 S v, x, x dx

23 MEP Estimators Order optimal estimators: For an order on the data domain V: Any estimator with lower variance on v, must have higher variance on z v The L* estimator: The unique admissible monotone estimator Order optimal for: z v f z < f(v) 4-variance competitive (soon we define that) The U* estimator: Order optimal for: z v f z > f(v) Choice of estimator depends on properties we want, possibly depending on typical data distribution. L* is a good default (monotone and competitive)

24 Variance Competitiveness [CK13] A worst-case over data theoretical indicator for estimator quality For each v, we can consider the minimum E? C[E,>] [f 6U S v, u, u ] attainable by an estimator that is unbiased and nonnegative for all other v We use such optimal estimator f 6 (v) for v as a reference point. An estimator f 6 S, u is c-competitive if for any data v, the expectation of the square is within a factor c of the minimum possible for v (by an unbiased and nonnegative estimator). For all unbiased nonnegative g, E? C[E,>] [f 6U S(v, u) ] c E? C E,> [g U S(v, u )] The L estimator is 4-competitive and this is tight. For some MEPs, ratio is 4

25 Optimal estimator f6(v) for data v (unbiased and nonnegative for all data) The optimal estimates f 6 (v) are the negated derivative of the lower hull of the Lower bound function. f(s, u) Lower Bound function for v Lower Hull u 1 Intuition: The lower bound guides us on outcome S, how high we can go with the estimate, in order to optimize variance for v while still being nonnegative on all other consistent data vectors.

26 The L* estimator We see S(v, u) and u. We know what S(v, x) is for all x > u f S(v, u), u u? > f S(v, x), x x U dx

27 L > estimation example Estimators for f v >, v U = v > v U Scheme: v a 0 is sampled if v a > u lower bound (LB) on f 0.6,0.2 from S and u The Lower hull of LB The L, U, and opt for v estimators U is optimized for the vector f 0.6,0.0 (always consistent with S) L is optimized locally for the vector f 0.6, u (consistent vector with smallest f

28 L U U estimation example Estimators for f v >, v U = v > v U U Scheme: v a 0 is sampled if v a > u lower bound (LB) on f 0.6,0.2 from S and u The Lower hull of LB The L, U, and opt for v estimators U is optimized for the vector f 0.6,0.0 (always consistent with S) L is optimized locally for the vector f 0.6, u (consistent vector with smallest f

29 Summary Defined Monotone Estimation Problems (MEPs) (motivated by coordinated sampling) Derive Pareto optimal (admissible) unbiased and nonnegative estimators (for any MEP when they exist): L* (lower end of range: unique monotone estimator, dominates HT), U* (upper end of range), Order optimal estimators (optimized for certain data patterns)

30 Applications Estimators for Euclidean and Manhattan distances from samples [C KDD 14] sketch-based closeness similarity in social networks [CDFGGW COSN 13] (similarity of the sets of longrange interactions) Sketching generalized coverage functions, including graph-based influence functions [CDPW 14, C 16]

31 Future Tighter bounds on universal ratio: L* is 4 competitive, can do competitive, lower bound is 1.44 competitive. Instance-optimal competitiveness Give efficient construction for any MEP Multi-dimensional MEPs: Have multiple independent seed (independent samples of columns ), some initial derivations for d = 2 and coverage and distance functions [CK 12, C 14], but the full picture is missing

32 L 1 distance [C KDD14] Independent / Coordinated PPS sampling #IP flows to a destination in two time periods

33 L U U distance [C KDD14] Independent/Coordinated PPS sampling Surname occurrences in 2007, 2008 books (Google ngrams)

34

35 Coordination of samples Very powerful tool for big data analysis with applications well beyond what [Brewer, Early, Joyce 1972] could envision Locality Sensitive Hashing (LSH) (similar weight vectors have similar samples/sketches) Multi-objective samples (universal samples): A single sample (as small as possible) that provides statistical guarantees for multiple sets of weights. Statistics/Domain queries that span multiple instances (Jaccard similarity, L X distances, distinct counts, union size, ) MinHash sketches are a special case with 0/1 weights. Facilitates faster computation of samples. Example: [C 97] Sketching/sampling reachability sets and neighborhoods of all nodes in a graph in near-linear time.

The Magic of Random Sampling: From Surveys to Big Data

The Magic of Random Sampling: From Surveys to Big Data The Magic of Random Sampling: From Surveys to Big Data Edith Cohen Google Research Tel Aviv University Disclaimer: Random sampling is classic and well studied tool with enormous impact across disciplines.

More information

Leveraging Big Data: Lecture 13

Leveraging Big Data: Lecture 13 Leveraging Big Data: Lecture 13 http://www.cohenwang.com/edith/bigdataclass2013 Instructors: Edith Cohen Amos Fiat Haim Kaplan Tova Milo What are Linear Sketches? Linear Transformations of the input vector

More information

Greedy Maximization Framework for Graph-based Influence Functions

Greedy Maximization Framework for Graph-based Influence Functions Greedy Maximization Framework for Graph-based Influence Functions Edith Cohen Google Research Tel Aviv University HotWeb '16 1 Large Graphs Model relations/interactions (edges) between entities (nodes)

More information

4.3 1st and 2nd derivative tests

4.3 1st and 2nd derivative tests CHAPTER 4. APPLICATIONS OF DERIVATIVES 08 4.3 st and nd derivative tests Definition. If f 0 () > 0 we say that f() is increasing. If f 0 () < 0 we say that f() is decreasing. f 0 () > 0 f 0 () < 0 Theorem

More information

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis

All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis 1 All-Distances Sketches, Revisited: HIP Estimators for Massive Graphs Analysis Edith Cohen arxiv:136.3284v7 [cs.ds] 17 Jan 215 Abstract Graph datasets with billions of edges, such as social and Web graphs,

More information

arxiv: v6 [cs.db] 2 Aug 2013

arxiv: v6 [cs.db] 2 Aug 2013 What You Can Do with Coordinated Samples Edith Cohen Haim Kaplan arxiv:126.5637v6 [cs.db] 2 Aug 213 Abstract Sample coordination, where similar instances have similar samples, was proposed by statisticians

More information

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005 CSCE 750 Final Exam Answer Key Wednesday December 7, 2005 Do all problems. Put your answers on blank paper or in a test booklet. There are 00 points total in the exam. You have 80 minutes. Please note

More information

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction CS168: The Modern Algorithmic Toolbox Lecture #4: Dimensionality Reduction Tim Roughgarden & Gregory Valiant April 12, 2017 1 The Curse of Dimensionality in the Nearest Neighbor Problem Lectures #1 and

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

Lecture 8: Basic convex analysis

Lecture 8: Basic convex analysis Lecture 8: Basic convex analysis 1 Convex sets Both convex sets and functions have general importance in economic theory, not only in optimization. Given two points x; y 2 R n and 2 [0; 1]; the weighted

More information

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics

Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Beyond Distinct Counting: LogLog Composable Sketches of frequency statistics Edith Cohen Google Research Tel Aviv University TOCA-SV 5/12/2017 8 Un-aggregated data 3 Data element e D has key and value

More information

Paper Reference. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced. Friday 6 June 2008 Afternoon Time: 1 hour 30 minutes

Paper Reference. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced. Friday 6 June 2008 Afternoon Time: 1 hour 30 minutes Centre No. Candidate No. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced Friday 6 June 2008 Afternoon Time: 1 hour 30 minutes Materials required for examination Mathematical Formulae

More information

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances

6 Distances. 6.1 Metrics. 6.2 Distances L p Distances 6 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

Notes on Mathematical Expectations and Classes of Distributions Introduction to Econometric Theory Econ. 770

Notes on Mathematical Expectations and Classes of Distributions Introduction to Econometric Theory Econ. 770 Notes on Mathematical Expectations and Classes of Distributions Introduction to Econometric Theory Econ. 77 Jonathan B. Hill Dept. of Economics University of North Carolina - Chapel Hill October 4, 2 MATHEMATICAL

More information

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013. The University of Texas at Austin Department of Electrical and Computer Engineering EE381V: Large Scale Learning Spring 2013 Assignment 1 Caramanis/Sanghavi Due: Thursday, Feb. 7, 2013. (Problems 1 and

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Find most influential set S of size k: largest expected cascade size f(s) if set S is activated

More information

9. Submodular function optimization

9. Submodular function optimization Submodular function maximization 9-9. Submodular function optimization Submodular function maximization Greedy algorithm for monotone case Influence maximization Greedy algorithm for non-monotone case

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

ECEN 689 Special Topics in Data Science for Communications Networks

ECEN 689 Special Topics in Data Science for Communications Networks ECEN 689 Special Topics in Data Science for Communications Networks Nick Duffield Department of Electrical & Computer Engineering Texas A&M University Lecture 5 Optimizing Fixed Size Samples Sampling as

More information

Introduction to discrete probability. The rules Sample space (finite except for one example)

Introduction to discrete probability. The rules Sample space (finite except for one example) Algorithms lecture notes 1 Introduction to discrete probability The rules Sample space (finite except for one example) say Ω. P (Ω) = 1, P ( ) = 0. If the items in the sample space are {x 1,..., x n }

More information

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here)

Piazza Recitation session: Review of linear algebra Location: Thursday, April 11, from 3:30-5:20 pm in SIG 134 (here) 4/0/9 Tim Althoff, UW CS547: Machine Learning for Big Data, http://www.cs.washington.edu/cse547 Piazza Recitation session: Review of linear algebra Location: Thursday, April, from 3:30-5:20 pm in SIG 34

More information

a i,1 a i,j a i,m be vectors with positive entries. The linear programming problem associated to A, b and c is to find all vectors sa b

a i,1 a i,j a i,m be vectors with positive entries. The linear programming problem associated to A, b and c is to find all vectors sa b LINEAR PROGRAMMING PROBLEMS MATH 210 NOTES 1. Statement of linear programming problems Suppose n, m 1 are integers and that A = a 1,1... a 1,m a i,1 a i,j a i,m a n,1... a n,m is an n m matrix of positive

More information

7 Algorithms for Massive Data Problems

7 Algorithms for Massive Data Problems 7 Algorithms for Massive Data Problems Massive Data, Sampling This chapter deals with massive data problems where the input data (a graph, a matrix or some other object) is too large to be stored in random

More information

Expectation of geometric distribution

Expectation of geometric distribution Expectation of geometric distribution What is the probability that X is finite? Can now compute E(X): Σ k=1f X (k) = Σ k=1(1 p) k 1 p = pσ j=0(1 p) j = p 1 1 (1 p) = 1 E(X) = Σ k=1k (1 p) k 1 p = p [ Σ

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Estimating arbitrary subset sums with few probes

Estimating arbitrary subset sums with few probes Estimating arbitrary subset sums with few probes Noga Alon Nick Duffield Carsten Lund Mikkel Thorup School of Mathematical Sciences, Tel Aviv University, Tel Aviv, Israel AT&T Labs Research, 8 Park Avenue,

More information

INSIGHT YEAR 12 Trial Exam Paper

INSIGHT YEAR 12 Trial Exam Paper INSIGHT YEAR 12 Trial Exam Paper 2013 MATHEMATICAL METHODS (CAS) STUDENT NAME: Written examination 1 QUESTION AND ANSWER BOOK Reading time: 15 minutes Writing time: 1 hour Structure of book Number of questions

More information

Month Demand Give an integer linear programming formulation for the problem of finding a production plan of minimum cost.

Month Demand Give an integer linear programming formulation for the problem of finding a production plan of minimum cost. 1.9 Modem production A company A, producing a single model of modem, is planning the production for the next 4 months. The producing capacity is of 2500 units per month, at a cost of 7 Euro per unit. A

More information

Differential Geometry III, Solutions 6 (Week 6)

Differential Geometry III, Solutions 6 (Week 6) Durham University Pavel Tumarkin Michaelmas 016 Differential Geometry III, Solutions 6 Week 6 Surfaces - 1 6.1. Let U R be an open set. Show that the set {x, y, z R 3 z = 0 and x, y U} is a regular surface.

More information

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances

7 Distances. 7.1 Metrics. 7.2 Distances L p Distances 7 Distances We have mainly been focusing on similarities so far, since it is easiest to explain locality sensitive hashing that way, and in particular the Jaccard similarity is easy to define in regards

More information

An Efficient reconciliation algorithm for social networks

An Efficient reconciliation algorithm for social networks An Efficient reconciliation algorithm for social networks Silvio Lattanzi (Google Research NY) Joint work with: Nitish Korula (Google Research NY) ICERM Stochastic Graph Models Outline Graph reconciliation

More information

Finding similar items

Finding similar items Finding similar items CSE 344, section 10 June 2, 2011 In this section, we ll go through some examples of finding similar item sets. We ll directly compare all pairs of sets being considered using the

More information

z + az + bz + 12 = 0, where a and b are real numbers, is

z + az + bz + 12 = 0, where a and b are real numbers, is 016 IJC H Math 9758 Promotional Examination 1 One root of the equation z = 1 i z + az + bz + 1 = 0, where a and b are real numbers, is. Without using a calculator, find the values of a and b and the other

More information

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples Expectation of geometric distribution Variance and Standard Deviation What is the probability that X is finite? Can now compute E(X): Σ k=f X (k) = Σ k=( p) k p = pσ j=0( p) j = p ( p) = E(X) = Σ k=k (

More information

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples

Bayesian inference for sample surveys. Roderick Little Module 2: Bayesian models for simple random samples Bayesian inference for sample surveys Roderick Little Module : Bayesian models for simple random samples Superpopulation Modeling: Estimating parameters Various principles: least squares, method of moments,

More information

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values.

In other words, we are interested in what is happening to the y values as we get really large x values and as we get really small x values. Polynomial functions: End behavior Solutions NAME: In this lab, we are looking at the end behavior of polynomial graphs, i.e. what is happening to the y values at the (left and right) ends of the graph.

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Errata and Proofs for Quickr [2]

Errata and Proofs for Quickr [2] Errata and Proofs for Quickr [2] Srikanth Kandula 1 Errata We point out some errors in the SIGMOD version of our Quickr [2] paper. The transitivity theorem, in Proposition 1 of Quickr, has a revision in

More information

Weighted Sampling for Scalable Analytics of Large Data Sets

Weighted Sampling for Scalable Analytics of Large Data Sets Weighted Sampling for Scalable Analytics of Large Data Sets 1 Google, CA USA 2 School of Computer Science Tel Aviv University, Israel May 24, 2016 Data Model Key value pairs (x, w x ) (users/activity,

More information

Heuristic Search Algorithms

Heuristic Search Algorithms CHAPTER 4 Heuristic Search Algorithms 59 4.1 HEURISTIC SEARCH AND SSP MDPS The methods we explored in the previous chapter have a serious practical drawback the amount of memory they require is proportional

More information

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices Communities Via Laplacian Matrices Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices The Laplacian Approach As with betweenness approach, we want to divide a social graph into

More information

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE

557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE 557: MATHEMATICAL STATISTICS II BIAS AND VARIANCE An estimator, T (X), of θ can be evaluated via its statistical properties. Typically, two aspects are considered: Expectation Variance either in terms

More information

ARE211, Fall 2005 CONTENTS. 5. Characteristics of Functions Surjective, Injective and Bijective functions. 5.2.

ARE211, Fall 2005 CONTENTS. 5. Characteristics of Functions Surjective, Injective and Bijective functions. 5.2. ARE211, Fall 2005 LECTURE #18: THU, NOV 3, 2005 PRINT DATE: NOVEMBER 22, 2005 (COMPSTAT2) CONTENTS 5. Characteristics of Functions. 1 5.1. Surjective, Injective and Bijective functions 1 5.2. Homotheticity

More information

Lecture 5: February 21, 2017

Lecture 5: February 21, 2017 COMS 6998: Advanced Complexity Spring 2017 Lecture 5: February 21, 2017 Lecturer: Rocco Servedio Scribe: Diana Yin 1 Introduction 1.1 Last Time 1. Established the exact correspondence between communication

More information

Urban Link Travel Time Estimation Using Large-scale Taxi Data with Partial Information

Urban Link Travel Time Estimation Using Large-scale Taxi Data with Partial Information Urban Link Travel Time Estimation Using Large-scale Taxi Data with Partial Information * Satish V. Ukkusuri * * Civil Engineering, Purdue University 24/04/2014 Outline Introduction Study Region Link Travel

More information

Paper Reference. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced Level. Monday 12 June 2006 Afternoon Time: 1 hour 30 minutes

Paper Reference. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced Level. Monday 12 June 2006 Afternoon Time: 1 hour 30 minutes Centre No. Candidate No. Paper Reference(s) 6665/01 Edexcel GCE Core Mathematics C3 Advanced Level Monday 12 June 2006 Afternoon Time: 1 hour 30 minutes Materials required for examination Mathematical

More information

Graphical Relationships Among f, f,

Graphical Relationships Among f, f, Graphical Relationships Among f, f, and f The relationship between the graph of a function and its first and second derivatives frequently appears on the AP exams. It will appear on both multiple choice

More information

AP Calculus. Analyzing a Function Based on its Derivatives

AP Calculus. Analyzing a Function Based on its Derivatives AP Calculus Analyzing a Function Based on its Derivatives Student Handout 016 017 EDITION Click on the following link or scan the QR code to complete the evaluation for the Study Session https://www.surveymonkey.com/r/s_sss

More information

Chapter 5. Eigenvalues and Eigenvectors

Chapter 5. Eigenvalues and Eigenvectors Chapter 5 Eigenvalues and Eigenvectors Section 5. Eigenvectors and Eigenvalues Motivation: Difference equations A Biology Question How to predict a population of rabbits with given dynamics:. half of the

More information

Math 40510, Algebraic Geometry

Math 40510, Algebraic Geometry Math 40510, Algebraic Geometry Problem Set 1, due February 10, 2016 1. Let k = Z p, the field with p elements, where p is a prime. Find a polynomial f k[x, y] that vanishes at every point of k 2. [Hint:

More information

Exact and Approximate Equilibria for Optimal Group Network Formation

Exact and Approximate Equilibria for Optimal Group Network Formation Exact and Approximate Equilibria for Optimal Group Network Formation Elliot Anshelevich and Bugra Caskurlu Computer Science Department, RPI, 110 8th Street, Troy, NY 12180 {eanshel,caskub}@cs.rpi.edu Abstract.

More information

Processing Top-k Queries from Samples

Processing Top-k Queries from Samples TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Processing Top-k Queries from Samples This thesis was submitted in partial fulfillment of the requirements

More information

Personalized Social Recommendations Accurate or Private

Personalized Social Recommendations Accurate or Private Personalized Social Recommendations Accurate or Private Presented by: Lurye Jenny Paper by: Ashwin Machanavajjhala, Aleksandra Korolova, Atish Das Sarma Outline Introduction Motivation The model General

More information

Algorithms for Data Science: Lecture on Finding Similar Items

Algorithms for Data Science: Lecture on Finding Similar Items Algorithms for Data Science: Lecture on Finding Similar Items Barna Saha 1 Finding Similar Items Finding similar items is a fundamental data mining task. We may want to find whether two documents are similar

More information

Sample Documents. NY Regents Math (I III) (NY1)

Sample Documents. NY Regents Math (I III) (NY1) Sample Documents NY Regents Math (I III) (NY1) E D U C A I D E S O F T W A R E Copyright c 1999 by EAS EducAide Software Inc. All rights reserved. Unauthorized reproduction of this document or the related

More information

Price of Stability in Survivable Network Design

Price of Stability in Survivable Network Design Noname manuscript No. (will be inserted by the editor) Price of Stability in Survivable Network Design Elliot Anshelevich Bugra Caskurlu Received: January 2010 / Accepted: Abstract We study the survivable

More information

Understanding Generalization Error: Bounds and Decompositions

Understanding Generalization Error: Bounds and Decompositions CIS 520: Machine Learning Spring 2018: Lecture 11 Understanding Generalization Error: Bounds and Decompositions Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Bayesian Methods: Naïve Bayes

Bayesian Methods: Naïve Bayes Bayesian Methods: aïve Bayes icholas Ruozzi University of Texas at Dallas based on the slides of Vibhav Gogate Last Time Parameter learning Learning the parameter of a simple coin flipping model Prior

More information

Presentation for CSG399 By Jian Wen

Presentation for CSG399 By Jian Wen Presentation for CSG399 By Jian Wen Outline Skyline Definition and properties Related problems Progressive Algorithms Analysis SFS Analysis Algorithm Cost Estimate Epsilon Skyline(overview and idea) Skyline

More information

CMU Lecture 4: Informed Search. Teacher: Gianni A. Di Caro

CMU Lecture 4: Informed Search. Teacher: Gianni A. Di Caro CMU 15-781 Lecture 4: Informed Search Teacher: Gianni A. Di Caro UNINFORMED VS. INFORMED Uninformed Can only generate successors and distinguish goals from non-goals Informed Strategies that can distinguish

More information

Homework 6: Solutions Sid Banerjee Problem 1: (The Flajolet-Martin Counter) ORIE 4520: Stochastics at Scale Fall 2015

Homework 6: Solutions Sid Banerjee Problem 1: (The Flajolet-Martin Counter) ORIE 4520: Stochastics at Scale Fall 2015 Problem 1: (The Flajolet-Martin Counter) In class (and in the prelim!), we looked at an idealized algorithm for finding the number of distinct elements in a stream, where we sampled uniform random variables

More information

Chapter 3: Element sampling design: Part 1

Chapter 3: Element sampling design: Part 1 Chapter 3: Element sampling design: Part 1 Jae-Kwang Kim Fall, 2014 Simple random sampling 1 Simple random sampling 2 SRS with replacement 3 Systematic sampling Kim Ch. 3: Element sampling design: Part

More information

LECTURE NOTES 57. Lecture 9

LECTURE NOTES 57. Lecture 9 LECTURE NOTES 57 Lecture 9 17. Hypothesis testing A special type of decision problem is hypothesis testing. We partition the parameter space into H [ A with H \ A = ;. Wewrite H 2 H A 2 A. A decision problem

More information

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs Hyokun Yun (work with S.V.N. Vishwanathan) Department of Statistics Purdue Machine Learning Seminar November 9, 2011 Overview

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

A tailor made nonparametric density estimate

A tailor made nonparametric density estimate A tailor made nonparametric density estimate Daniel Carando 1, Ricardo Fraiman 2 and Pablo Groisman 1 1 Universidad de Buenos Aires 2 Universidad de San Andrés School and Workshop on Probability Theory

More information

Support weight enumerators and coset weight distributions of isodual codes

Support weight enumerators and coset weight distributions of isodual codes Support weight enumerators and coset weight distributions of isodual codes Olgica Milenkovic Department of Electrical and Computer Engineering University of Colorado, Boulder March 31, 2003 Abstract In

More information

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013

University of New Mexico Department of Computer Science. Final Examination. CS 561 Data Structures and Algorithms Fall, 2013 University of New Mexico Department of Computer Science Final Examination CS 561 Data Structures and Algorithms Fall, 2013 Name: Email: This exam lasts 2 hours. It is closed book and closed notes wing

More information

LESSON 2 5 CHAPTER 2 OBJECTIVES

LESSON 2 5 CHAPTER 2 OBJECTIVES LESSON 2 5 CHAPTER 2 OBJECTIVES POSTULATE a statement that describes a fundamental relationship between the basic terms of geometry. THEOREM a statement that can be proved true. PROOF a logical argument

More information

Applications of Differentiation

Applications of Differentiation Applications of Differentiation Definitions. A function f has an absolute maximum (or global maximum) at c if for all x in the domain D of f, f(c) f(x). The number f(c) is called the maximum value of f

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical

More information

Efficient Approximation for Restricted Biclique Cover Problems

Efficient Approximation for Restricted Biclique Cover Problems algorithms Article Efficient Approximation for Restricted Biclique Cover Problems Alessandro Epasto 1, *, and Eli Upfal 2 ID 1 Google Research, New York, NY 10011, USA 2 Department of Computer Science,

More information

The Rocket Car. UBC Math 403 Lecture Notes by Philip D. Loewen

The Rocket Car. UBC Math 403 Lecture Notes by Philip D. Loewen The Rocket Car UBC Math 403 Lecture Notes by Philip D. Loewen We consider this simple system with control constraints: ẋ 1 = x, ẋ = u, u [ 1, 1], Dynamics. Consider the system evolution on a time interval

More information

4 Nonlinear Equations

4 Nonlinear Equations 4 Nonlinear Equations lecture 27: Introduction to Nonlinear Equations lecture 28: Bracketing Algorithms The solution of nonlinear equations has been a motivating challenge throughout the history of numerical

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

ASM Study Manual for Exam P, First Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA Errata

ASM Study Manual for Exam P, First Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA Errata ASM Study Manual for Exam P, First Edition By Dr. Krzysztof M. Ostaszewski, FSA, CFA, MAAA (krzysio@krzysio.net) Errata Effective July 5, 3, only the latest edition of this manual will have its errata

More information

Math Models of OR: Some Definitions

Math Models of OR: Some Definitions Math Models of OR: Some Definitions John E. Mitchell Department of Mathematical Sciences RPI, Troy, NY 12180 USA September 2018 Mitchell Some Definitions 1 / 20 Active constraints Outline 1 Active constraints

More information

Transformation of Quasiconvex Functions to Eliminate. Local Minima. JOTA manuscript No. (will be inserted by the editor) Suliman Al-Homidan Nicolas

Transformation of Quasiconvex Functions to Eliminate. Local Minima. JOTA manuscript No. (will be inserted by the editor) Suliman Al-Homidan Nicolas JOTA manuscript No. (will be inserted by the editor) Transformation of Quasiconvex Functions to Eliminate Local Minima Suliman Al-Homidan Nicolas Hadjisavvas Loai Shaalan Communicated by Constantin Zalinescu

More information

f x x, where f x (E) f, where ln

f x x, where f x (E) f, where ln AB Review 08 Calculator Permitted (unless stated otherwise) 1. h0 ln e h 1 lim is h (A) f e, where f ln (B) f e, where f (C) f 1, where ln (D) f 1, where f ln e ln 0 (E) f, where ln f f 1 1, where t is

More information

Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime

Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime Source Coding and Function Computation: Optimal Rate in Zero-Error and Vanishing Zero-Error Regime Solmaz Torabi Dept. of Electrical and Computer Engineering Drexel University st669@drexel.edu Advisor:

More information

Internal link prediction: a new approach for predicting links in bipartite graphs

Internal link prediction: a new approach for predicting links in bipartite graphs Internal link prediction: a new approach for predicting links in bipartite graphs Oussama llali, lémence Magnien and Matthieu Latapy LIP6 NRS and Université Pierre et Marie urie (UPM Paris 6) 4 place Jussieu

More information

Exact and Approximate Equilibria for Optimal Group Network Formation

Exact and Approximate Equilibria for Optimal Group Network Formation Noname manuscript No. will be inserted by the editor) Exact and Approximate Equilibria for Optimal Group Network Formation Elliot Anshelevich Bugra Caskurlu Received: December 2009 / Accepted: Abstract

More information

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties

ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties ELEC6910Q Analytics and Systems for Social Media and Big Data Applications Lecture 3 Centrality, Similarity, and Strength Ties Prof. James She james.she@ust.hk 1 Last lecture 2 Selected works from Tutorial

More information

Lecture 4: UMVUE and unbiased estimators of 0

Lecture 4: UMVUE and unbiased estimators of 0 Lecture 4: UMVUE and unbiased estimators of Problems of approach 1. Approach 1 has the following two main shortcomings. Even if Theorem 7.3.9 or its extension is applicable, there is no guarantee that

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

CSE 291: Fourier analysis Chapter 2: Social choice theory

CSE 291: Fourier analysis Chapter 2: Social choice theory CSE 91: Fourier analysis Chapter : Social choice theory 1 Basic definitions We can view a boolean function f : { 1, 1} n { 1, 1} as a means to aggregate votes in a -outcome election. Common examples are:

More information

McGILL UNIVERSITY FACULTY OF SCIENCE FINAL EXAMINATION MATHEMATICS A CALCULUS I EXAMINER: Professor K. K. Tam DATE: December 11, 1998 ASSOCIATE

McGILL UNIVERSITY FACULTY OF SCIENCE FINAL EXAMINATION MATHEMATICS A CALCULUS I EXAMINER: Professor K. K. Tam DATE: December 11, 1998 ASSOCIATE NOTE TO PRINTER (These instructions are for the printer. They should not be duplicated.) This examination should be printed on 8 1 2 14 paper, and stapled with 3 side staples, so that it opens like a long

More information

I will post Homework 1 soon, probably over the weekend, due Friday, September 30.

I will post Homework 1 soon, probably over the weekend, due Friday, September 30. Random Variables Friday, September 09, 2011 2:02 PM I will post Homework 1 soon, probably over the weekend, due Friday, September 30. No class or office hours next week. Next class is on Tuesday, September

More information

Frequency Estimators

Frequency Estimators Frequency Estimators Outline for Today Randomized Data Structures Our next approach to improving performance. Count-Min Sketches A simple and powerful data structure for estimating frequencies. Count Sketches

More information

CS 5522: Artificial Intelligence II

CS 5522: Artificial Intelligence II CS 5522: Artificial Intelligence II Bayes Nets: Independence Instructor: Alan Ritter Ohio State University [These slides were adapted from CS188 Intro to AI at UC Berkeley. All materials available at http://ai.berkeley.edu.]

More information

ch 3 applications of differentiation notebook.notebook January 17, 2018 Extrema on an Interval

ch 3 applications of differentiation notebook.notebook January 17, 2018 Extrema on an Interval Extrema on an Interval Extrema, or extreme values, are the minimum and maximum of a function. They are also called absolute minimum and absolute maximum (or global max and global min). Extrema that occur

More information

On the Scalability in Cooperative Control. Zhongkui Li. Peking University

On the Scalability in Cooperative Control. Zhongkui Li. Peking University On the Scalability in Cooperative Control Zhongkui Li Email: zhongkli@pku.edu.cn Peking University June 25, 2016 Zhongkui Li (PKU) Scalability June 25, 2016 1 / 28 Background Cooperative control is to

More information