Sampling Everything Data CompSci 290.01 Spring 2014
2 Announcements (Thu. Mar 26) Homework #11 will be posted by noon tomorrow.
3 Outline Simple Random Sampling Means & Proportions Importance Sampling Counting Δs in a graph Graph sampling
4 Sampling Using a fraction of the available data to make inferences about the whole dataset Why? Dataset is very large Cost associated with data acquisition is high Computing the answer on the entire dataset is time consuming
5 Population Population: a dataset D with N records Population Statistics Mean: E.g., average weight of newborns in the US Count: E.g., Number of votes cast for a presidential candidate Proportion: E.g., Fraction of pages on the Web that are spam
6 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) =? # possible samples =?
7 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) = n/n # possible samples = C(N, n) = N! / {(N- n)! n!}
8 Population vs Sample Will a statistic (mean/count/proportion) computed on the sample be the same as that computed on the population? Not necessarily
9 Population vs Sample Proportion of even numbers in the set of all natural numbers is 0.5 Sample may contain only odd numbers. But the probability of geeing such a sample is tiny. P(sample of n numbers are all odd) = 2 - n
10 Sampling Distribution Let µμ be a population statistic. Let µμ S be the same statistic on the sample. Sampling Distribution: Probability distribution of µμ S obtained by considering all possible samples of size n.
11 Expected Sample Statistic For µμ = mean and proportion: E[µμ S ] = µμ For µμ = count: E[µμ S ] = nµμ / N
12 But what about your sample? We would like to say: µμ S for a large (1- δ) fraction of all samples is within ε of µμ Additive (aka confidence interval): P[ µμ S µμ < ε] > 1- δ Multiplicative: P[(1- ε)µμ < µμ S < (1+ε)µμ] > 1- δ
13 Confidence Intervals Let µμ = population mean Let σ = population standard deviation Let σ S = Standard deviation of µμ S (aka standard error) σ S = σ / n
14 Confidence Intervals By central limit theorem, sampling distribution is close to normal distribution ( for n > 25) http://www.unc.edu/~hakan/econ70/lec05.pdf
15 Confidence Intervals σ S = σ / n + sampling distr Normal For at least 95% fraction of samples, µμ S - µμ < 2 σ / n when n > 25
16 Confidence intervals for proportions p : proportion in the population ps : proportion in the sample At least 95% of all samples have when np > 5 and n(1- p) > 5
17 Summary of Simple Random Sampling All of statistics involves using a sample to learn properties of population Sampling distribution helps connect the sample statistic to population statistic In expectation, mean/proportion of sample equals mean/proportion of population Confidence intervals help us decide number of samples needed.
18 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling
19 Counting problems Triangle counting: Find the number of triplets of vertices (a, b, c) such that (a,b), (b,c) and (c,a) are edges. Advertising contracts: Need 1 million impressions that satisfy [Male, 15-35, CA] OR [Male, 5-25, TX] Historically, were a million such individuals seen?
20 Counting can be hard Triangle counting (N= # nodes, M= #edges, d max = max degree): Naïve Method: O(N 3 ) Look at every triple (a,b,c) and check for Δ Best known methods: O(d max 2 N) or O(m 1.5 ) Not efficient for large graphs Twieer 2009: N = 54,981,152, M = 1,963,263,821, d max > 3 million
21 Sampling to the rescue Suppose S is the set whose size we want to estimate Let U be some universe such that S is a subset of U
22 Monte Carlo Method For i = 1 to n Choose y from U uniformly at random Check whether y is in S Let Xi = 1 if y is in S, and 0 otherwise Return: Stop when the estimated count converges
23 When to use Monte Carlo Method Easy to uniformly sample from U Easy to check whether sample is in S Estimate Ĉ converges with a small number of samples P[(1- ε) S < Ĉ < (1+ε) S ] > (1- δ)
24 Triangle Counting U = set of all triples U = N 3. # Samples needed for convergence! >!!! 3 2!!!ln!!!! Number of Δs can be much smaller than N. No beeer than the naïve algorithm.
25 Triangle Counting U = T0 + T1 + T2 + T3 Ti : triples that have i edges amongst them S = T3! > (!! +!! +!! +!! )!! 3!!!ln 2!!!! Most triples are in T0. Can we not sample any triple in T0?
26 Biased sampling Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random Universe does not contain triples from T0 Universe contains all triples from T1 Universe contains all triples from T2 they are twice more likely! Universe contains all triples from T3 they are thrice more likely!
27 Triangle counting 2.0 Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random U = T1 + 2 T2 + 3 T3 S = T3! > 3 +!! + 2!!!! 3!!!ln 2!!!!
28 Summary of Counting Counting exactly can be time consuming Monte Carlo method: pick samples from a universe containing the set of interest, and approximate the count using samples. Simple random sample may need many samples. Biased sampling can help.
29 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling
30 Problem Given an input graph G = (V, E), construct a subgraph H = (V, E ), where V and E are subsets of V and E, resp. Want H to have the same properties as G
31 Properties of interest Density of the graph (# edges/ #nodes) Degree distribution Distance distribution Clustering coefficients distribution
32 How to Sample? On edges? On nodes?
33 Edge Sampling For every edge in the original graph, retain it with some probability p.
34 Edge Sampling Resulting graph is extremely sparse. Diameter is very large Clustering coefficient is very small Nodes with high degree in the original graph also have high degree in the sampled graph. Not a random sample on nodes Degree distribution is not preserved
35 Node Sampling Choose V as a simple random sample of V Induced Subgraph: E = set of edges (a,b) that appear in E, such that both a and b are in V
36 Node Sampling Degree distribution looks similar Resulting graph is sparser compared to the original graph Clustering coefficient is much smaller When nodes are sampled uniformly at random, they are unlikely to sample triples that form triangles.
37 Random Walk Sampling Pick a starting node uniformly at random Perform a random walk With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to starting node Stop when enough number of nodes are visited, and compute the induced subgraph
38 Random Walk Sampling Is also biased to high degree nodes So the degree distribution will not look very similar. Preserves the clustering coefficient
39 Random Jump sampling Pick a starting node uniformly at random Perform a random walk with jumps With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to an arbitrary node To avoid ge]ing stuck in dead- ends in directed graphs.
40 Summary of Graph Sampling Graph sampling is not like sampling from tables Hard to preserve all statistics using a single sample. Node sampling preserves the shape of the degree distribution Random walk sampling preserves the clustering in the graph.