Sampling. Everything Data CompSci Spring 2014

Similar documents
Markov Chains and MCMC

Slide source: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University.

Link Mining PageRank. From Stanford C246

Project in Computational Game Theory: Communities in Social Networks

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

Slides based on those in:

Data Mining Techniques

Randomized Algorithms

CS155: Probability and Computing: Randomized Algorithms and Probabilistic Analysis

An Algorithmist s Toolkit September 24, Lecture 5

STA 4273H: Statistical Machine Learning

CMPUT651: Differential Privacy

Link Analysis Ranking

Random Walk Based Algorithms for Complex Network Analysis

A Linear Time Algorithm for Finding Three Edge-Disjoint Paths in Eulerian Networks

P P P NP-Hard: L is NP-hard if for all L NP, L L. Thus, if we could solve L in polynomial. Cook's Theorem and Reductions

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Data and Algorithms of the Web

Link Analysis. Stony Brook University CSE545, Fall 2016

UC Berkeley Math 10B, Spring 2015: Midterm 2 Prof. Sturmfels, April 9, SOLUTIONS

On-Line Social Systems with Long-Range Goals

Quantum Algorithms for Finding Constant-sized Sub-hypergraphs

Stochastic calculus for summable processes 1

Theorem 1.7 [Bayes' Law]: Assume that,,, are mutually disjoint events in the sample space s.t.. Then Pr( )

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

[Title removed for anonymity]

Interpretation of results through confidence intervals

Team Solutions. November 19, qr = (m + 2)(m 2)

0.1 Naive formulation of PageRank

CS-C Data Science Chapter 8: Discrete methods for analyzing large binary datasets

COMPSCI 514: Algorithms for Data Science

Homework 4 Solutions

MAT 271E Probability and Statistics

Distributed Optimization. Song Chong EE, KAIST

More on NP and Reductions

Introduction to Artificial Intelligence Midterm 2. CS 188 Spring You have approximately 2 hours and 50 minutes.

Monte Carlo methods in PageRank computation: When one iteration is sufficient

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

Introduction to Data Mining

NP-Complete Problems and Approximation Algorithms

1998: enter Link Analysis

Ad Placement Strategies

Introduction to Data Mining

Markov Networks. l Like Bayes Nets. l Graphical model that describes joint probability distribution using tables (AKA potentials)

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

Learning with multiple models. Boosting.

References. Markov-Chain Monte Carlo. Recall: Sampling Motivation. Problem. Recall: Sampling Methods. CSE586 Computer Vision II

Similar Shapes and Gnomons

P(X 0 = j 0,... X nk = j k )

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Horizontal and Vertical Asymptotes from section 2.6

Nonparametric Bayesian Methods (Gaussian Processes)

CS6220: DATA MINING TECHNIQUES

2. Outliers and inference for regression

ECON 331 Homework #2 - Solution. In a closed model the vector of external demand is zero, so the matrix equation writes:

Lecture 5: January 30

Model Counting for Logical Theories

MATH 2200 Final LC Review

ECE 5984: Introduction to Machine Learning

Lecture 15: MCMC Sanjeev Arora Elad Hazan. COS 402 Machine Learning and Artificial Intelligence Fall 2016

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Modeling Data Correlations in Private Data Mining with Markov Model and Markov Networks. Yang Cao Emory University

Lecture Topic 4: Chapter 7 Sampling and Sampling Distributions

CPSC 540: Machine Learning

Markov-Chain Monte Carlo

ε-nets and VC Dimension

1.3 Vertex Degrees. Vertex Degree for Undirected Graphs: Let G be an undirected. Vertex Degree for Digraphs: Let D be a digraph and y V (D).

Discrete Mathematics. Spring 2017

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Bayes Networks 6.872/HST.950

An Introduction to Reversible Jump MCMC for Bayesian Networks, with Application

Machine Learning CPSC 340. Tutorial 12

Bayes Nets III: Inference

35 38 Absolute Value Quiz and Unit 8 Review.notebook April 18, Learning Target 2ab: I can write and solve absolute value equations.

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CS249: ADVANCED DATA MINING

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

< k 2n. 2 1 (n 2). + (1 p) s) N (n < 1

Sampling Algorithms for Probabilistic Graphical models

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017

Online Advertising is Big Business

6.207/14.15: Networks Lecture 7: Search on Networks: Navigation and Web Search

Estimating a Population Mean. Section 7-3

A Note on Google s PageRank

CS 277: Data Mining. Mining Web Link Structure. CS 277: Data Mining Lectures Analyzing Web Link Structure Padhraic Smyth, UC Irvine

d(x n, x) d(x n, x nk ) + d(x nk, x) where we chose any fixed k > N

MAT 271E Probability and Statistics

ECE 695 Numerical Simulations Lecture 2: Computability and NPhardness. Prof. Peter Bermel January 11, 2017

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Data Integration for Big Data Analysis for finite population inference

Alternative Parameterizations of Markov Networks. Sargur Srihari

Web Structure Mining Nodes, Links and Influence

Graph Detection and Estimation Theory

Undirected Graphical Models

Proportion. Lecture 25 Sections Fri, Oct 10, Hampden-Sydney College. Sampling Distribution of a Sample. Proportion. Robb T.

Stochastic Gradient Descent

CS 188: Artificial Intelligence. Bayes Nets

Transcription:

Sampling Everything Data CompSci 290.01 Spring 2014

2 Announcements (Thu. Mar 26) Homework #11 will be posted by noon tomorrow.

3 Outline Simple Random Sampling Means & Proportions Importance Sampling Counting Δs in a graph Graph sampling

4 Sampling Using a fraction of the available data to make inferences about the whole dataset Why? Dataset is very large Cost associated with data acquisition is high Computing the answer on the entire dataset is time consuming

5 Population Population: a dataset D with N records Population Statistics Mean: E.g., average weight of newborns in the US Count: E.g., Number of votes cast for a presidential candidate Proportion: E.g., Fraction of pages on the Web that are spam

6 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) =? # possible samples =?

7 Simple random sample (SRS) SRS(D, n) is a sample of n records where each record is in the sample with equal probability P(record in sample) = n/n # possible samples = C(N, n) = N! / {(N- n)! n!}

8 Population vs Sample Will a statistic (mean/count/proportion) computed on the sample be the same as that computed on the population? Not necessarily

9 Population vs Sample Proportion of even numbers in the set of all natural numbers is 0.5 Sample may contain only odd numbers. But the probability of geeing such a sample is tiny. P(sample of n numbers are all odd) = 2 - n

10 Sampling Distribution Let µμ be a population statistic. Let µμ S be the same statistic on the sample. Sampling Distribution: Probability distribution of µμ S obtained by considering all possible samples of size n.

11 Expected Sample Statistic For µμ = mean and proportion: E[µμ S ] = µμ For µμ = count: E[µμ S ] = nµμ / N

12 But what about your sample? We would like to say: µμ S for a large (1- δ) fraction of all samples is within ε of µμ Additive (aka confidence interval): P[ µμ S µμ < ε] > 1- δ Multiplicative: P[(1- ε)µμ < µμ S < (1+ε)µμ] > 1- δ

13 Confidence Intervals Let µμ = population mean Let σ = population standard deviation Let σ S = Standard deviation of µμ S (aka standard error) σ S = σ / n

14 Confidence Intervals By central limit theorem, sampling distribution is close to normal distribution ( for n > 25) http://www.unc.edu/~hakan/econ70/lec05.pdf

15 Confidence Intervals σ S = σ / n + sampling distr Normal For at least 95% fraction of samples, µμ S - µμ < 2 σ / n when n > 25

16 Confidence intervals for proportions p : proportion in the population ps : proportion in the sample At least 95% of all samples have when np > 5 and n(1- p) > 5

17 Summary of Simple Random Sampling All of statistics involves using a sample to learn properties of population Sampling distribution helps connect the sample statistic to population statistic In expectation, mean/proportion of sample equals mean/proportion of population Confidence intervals help us decide number of samples needed.

18 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling

19 Counting problems Triangle counting: Find the number of triplets of vertices (a, b, c) such that (a,b), (b,c) and (c,a) are edges. Advertising contracts: Need 1 million impressions that satisfy [Male, 15-35, CA] OR [Male, 5-25, TX] Historically, were a million such individuals seen?

20 Counting can be hard Triangle counting (N= # nodes, M= #edges, d max = max degree): Naïve Method: O(N 3 ) Look at every triple (a,b,c) and check for Δ Best known methods: O(d max 2 N) or O(m 1.5 ) Not efficient for large graphs Twieer 2009: N = 54,981,152, M = 1,963,263,821, d max > 3 million

21 Sampling to the rescue Suppose S is the set whose size we want to estimate Let U be some universe such that S is a subset of U

22 Monte Carlo Method For i = 1 to n Choose y from U uniformly at random Check whether y is in S Let Xi = 1 if y is in S, and 0 otherwise Return: Stop when the estimated count converges

23 When to use Monte Carlo Method Easy to uniformly sample from U Easy to check whether sample is in S Estimate Ĉ converges with a small number of samples P[(1- ε) S < Ĉ < (1+ε) S ] > (1- δ)

24 Triangle Counting U = set of all triples U = N 3. # Samples needed for convergence! >!!! 3 2!!!ln!!!! Number of Δs can be much smaller than N. No beeer than the naïve algorithm.

25 Triangle Counting U = T0 + T1 + T2 + T3 Ti : triples that have i edges amongst them S = T3! > (!! +!! +!! +!! )!! 3!!!ln 2!!!! Most triples are in T0. Can we not sample any triple in T0?

26 Biased sampling Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random Universe does not contain triples from T0 Universe contains all triples from T1 Universe contains all triples from T2 they are twice more likely! Universe contains all triples from T3 they are thrice more likely!

27 Triangle counting 2.0 Sample an edge (a,b) uniformly at random Sample a node that is not a or b at random U = T1 + 2 T2 + 3 T3 S = T3! > 3 +!! + 2!!!! 3!!!ln 2!!!!

28 Summary of Counting Counting exactly can be time consuming Monte Carlo method: pick samples from a universe containing the set of interest, and approximate the count using samples. Simple random sample may need many samples. Biased sampling can help.

29 Outline Simple Random Sampling Means & Proportions Biased Sampling Counting Δs in a graph Graph sampling

30 Problem Given an input graph G = (V, E), construct a subgraph H = (V, E ), where V and E are subsets of V and E, resp. Want H to have the same properties as G

31 Properties of interest Density of the graph (# edges/ #nodes) Degree distribution Distance distribution Clustering coefficients distribution

32 How to Sample? On edges? On nodes?

33 Edge Sampling For every edge in the original graph, retain it with some probability p.

34 Edge Sampling Resulting graph is extremely sparse. Diameter is very large Clustering coefficient is very small Nodes with high degree in the original graph also have high degree in the sampled graph. Not a random sample on nodes Degree distribution is not preserved

35 Node Sampling Choose V as a simple random sample of V Induced Subgraph: E = set of edges (a,b) that appear in E, such that both a and b are in V

36 Node Sampling Degree distribution looks similar Resulting graph is sparser compared to the original graph Clustering coefficient is much smaller When nodes are sampled uniformly at random, they are unlikely to sample triples that form triangles.

37 Random Walk Sampling Pick a starting node uniformly at random Perform a random walk With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to starting node Stop when enough number of nodes are visited, and compute the induced subgraph

38 Random Walk Sampling Is also biased to high degree nodes So the degree distribution will not look very similar. Preserves the clustering coefficient

39 Random Jump sampling Pick a starting node uniformly at random Perform a random walk with jumps With probability (1- c), choose an outgoing edge uniformly at random With probability c, jump to an arbitrary node To avoid ge]ing stuck in dead- ends in directed graphs.

40 Summary of Graph Sampling Graph sampling is not like sampling from tables Hard to preserve all statistics using a single sample. Node sampling preserves the shape of the degree distribution Random walk sampling preserves the clustering in the graph.