Sampling and Estimation in Network Graphs

Size: px

Start display at page:

Download "Sampling and Estimation in Network Graphs"

Lorena Johnston
6 years ago
Views:

1 Sampling and Estimation in Network Graphs Gonzalo Mateos Dept. of ECE and Goergen Institute for Data Science University of Rochester March 23, 2017 Network Science Analytics Sampling and Estimation in Network Graphs 1

2 Network sampling Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 2

3 Sampling network graphs Measurements often gathered only from a portion of a complex system Ex: social study of high-school class vs. large corporation, Internet Network graph sample from a larger underlying network Goal: use sampled network data to infer properties of the whole system Approach using principles of statistical sampling theory Sampling in network contexts introduces various potential challenges System under study G(V, E) Population graph Random Procedure Available measurements G (V, E ) Sampled graph G often a subgraph of G (i.e., V V, E E), but may not be Network Science Analytics Sampling and Estimation in Network Graphs 3

4 The fundamental problem Suppose a given graph characteristic or summary η(g) is of interest Ex: order Nv, size N e, degree d v, clustering coefficient cl(g),... Typically impossible to recover η(g) exactly from G Q: Can we still form a useful estimate ˆη = ˆη(G ) of η(g)? Plug-in estimator ˆη := η(g ) Boils down to computing the characteristic of interest in G Many familiar estimators in statistical practice are of this type Ex: sample means, standard deviations, covariances, quantiles... Oftentimes η(g ) is a poor representation of η(g) Network Science Analytics Sampling and Estimation in Network Graphs 4

5 Example: Estimating average degreee Let G(V, E) be a network of protein interactions in yeast Characteristic of interest is average degree η(g) = 1 N v i V Here N v = 5, 151, N e = 31, 201 η(g) = Consider two sampling designs to obtain G First sample n vertices V = {i 1,..., i n } without replacement Design 1: For each i V, observe incident edges (i, j) E Design 2: Observe edge (i, j) only when both i, j V Estimate η(g) by averaging the observed degree sequence {d i } i V η(g ) = 1 n d i di i V Network Science Analytics Sampling and Estimation in Network Graphs 5

6 Example: Estimating average degreee (cont.) Random sample of n = 1, 500 vertices, Designs 1 and 2 for edges Process repeated for 10,000 trials histogram of η(g ) Density Design 2 Design Estimate of of average Average Degree degree Under-estimate η(g) for Design 2, but Design 1 on target. Why? Design 1: sample vertex degree explicitly, i.e., d i = d i Design 2: (implicitly) sample vertex degree with bias, i.e., d i n N v d i Network Science Analytics Sampling and Estimation in Network Graphs 6

7 Improving estimation accuracy In order to do better we need to incorporate the effects of Random sampling; and/or Measurement error Sampling design, topology of G, nature of η( ) all critical Model-based inference Likelihood-based and Bayesian paradigms Design-based methods Statistical sampling theory Assume observations made without measurement error Only source of randomness sampling procedure Ex: Estimating average degree Under Design 2 the estimate is biased, with mean of only Adjusting η(g ) upward by a factor Nv = yields 12,115 n Will see how statistical sampling theory justifies this correction Network Science Analytics Sampling and Estimation in Network Graphs 7

8 Background Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 8

9 Statistical sampling theory Suppose we have a population U = {1,..., N u } of N u units Ex: People, animals, objects, vertices,... A value y i is associated with each unit i U Ex: Height, age, gender, infected, membership,... Typical interest in the population totals τ and averages µ τ := y i and µ := 1 y i = 1 τ N u N u i U i U Basic sampling theory paradigm oriented around these steps: S1: Randomly sample n units S = {i 1,..., i n } from U S2: Observe the value y ik for k = 1,..., n S3: Form an unbiased estimator ˆµ of µ, i.e., E [ˆµ] = µ S4: Evaluate or estimate the variance var [ˆµ] Network Science Analytics Sampling and Estimation in Network Graphs 9

10 Inclusion probabilities Def: For given sampling design, the inclusion probability π i of unit i is π i := P (unit i belongs in the sample S) Simple random sampling (SRS): n units sampled uniformly form U Without replacement: i 1 chosen from U, i 2 from U \ {i 1 }, and so on There are ( N u ) n such possible samples of size n There are ( N u 1) n 1 samples which include a given unit i The inclusion probability is π i = ( Nu 1 n 1 ( Nu n ) ) = n N u Network Science Analytics Sampling and Estimation in Network Graphs 10

11 Sample mean estimator Definition of sample mean estimator ˆµ = 1 n Using indicator RVs I {i S} for i U, where E [I {i S}] = π i [ ] [ ] 1 1 N u E [ˆµ] = E y i = E y i I {i S} n n = 1 n N u i=1 i S i S y i i=1 y i E [I {i S}] = 1 n N u i=1 y i π i SRS without replacement unbiased because π i = n N u Unequal probability sampling More common than SRS, especially with networks. (More soon) Sample mean can be a poor (i.e., biased) estimator for µ Network Science Analytics Sampling and Estimation in Network Graphs 11

12 Horvitz-Thompson estimation for totals Idea: weighted average using inclusion probabilities as weights Horvitz-Thompson (HT) estimator ˆµ π = 1 N u i S y i π i and ˆτ π = N u ˆµ π Remedies the bias problem E [ˆµ π ] = 1 N u N u i=1 y i π i E [I {i S}] = 1 N u Size of the population N u assumed known N u i=1 y i = µ Broad applicability, but π i may be difficult to compute Network Science Analytics Sampling and Estimation in Network Graphs 12

13 Horvitz-Thompson estimator variance Def: Joint inclusion probability π ij of units i and j is π ij := P (units i and j belong in the sample S) If inclusion of units i and j are independent events π ij = π i π j Ex: Simple random sampling without replacement yields π ij = Variance of the HT estimator: var [ˆτ π ] = i U j U n(n 1) N u (N u 1) y i y j ( πij π i π j 1 ), var [ˆµ π ] = var [ˆτ π] N 2 u Typically estimated in an unbiased fashion from the sample S Network Science Analytics Sampling and Estimation in Network Graphs 13

14 Probability proportional to size sampling Unequal probability sampling n units selected w.r.t. a distribution {p 1,..., p Nu } on U Uniform sampling: special case with p i = 1 N u Probability proportional to size (PPS) sampling for all i U Probabilities p i proportional to a characteristic c i Ex: households chosen by drawing names from a database If sampling with replacement, PPS inclusion probabilities are π i = 1 (1 p i ) n, where p i = c i k c k Joint inclusion probabilities for variance calculations π ij = π i + π j [1 (1 p i p j ) n ] Network Science Analytics Sampling and Estimation in Network Graphs 14

15 Estimation of group size So far implicitly assumed N u known Often not the case! Ex: endangered animal species, people at risk of rare disease Special population total often of interest is the group size N u = 1 i U Suggests the following HT estimator of N u ˆN u = i S π 1 i Infeasible, since knowledge of N u needed to compute π i Network Science Analytics Sampling and Estimation in Network Graphs 15

16 Capture-recapture estimator Capture-recapture estimators overcome HT limitations in this setting Two rounds of SRS without replacement Two samples S 1, S 2 Round 1: Mark all units in sample S 1 of size n 1 from U Ex: tagging a fish, noting the ID number... All units in S1 are returned to the population Round 2: Obtain a sample S 2 of size n 2 from U Capture-recapture estimator of N u ˆN u := n 2 m n 1, where m := S 1 S 2 Factor m/n 2 indicative of marked fraction of the overall population Can derive using model-based arguments as an ML estimator Network Science Analytics Sampling and Estimation in Network Graphs 16

17 Common network graph sampling designs Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 17

18 Graph sampling designs Q: What are common designs for sampling a network graph G? A: Will see a few examples, along with their inclusion probabilities π i Graph-based sampling designs Two inter-related classes of units, vertices i and edges (i, j) Often two stages Selection among one class of units (e.g., vertices) Observation of units from the other class (e.g., edges) Inclusion probabilities offer insight into the nature of the designs Central to HT estimators of network graph characteristics η(g) Network Science Analytics Sampling and Estimation in Network Graphs 18

19 Induced subgraph sampling S) Sample n vertices V = {i 1,..., i n } without replacement (SRS) O) Observe edges (i, j) E only when both i, j V (induced by V ) Ex: construction of contact networks in social network research Vertex and edge inclusion probabilities are uniformly equal to π i = n N v and π {i,j} = n(n 1) N v (N v 1) Network Science Analytics Sampling and Estimation in Network Graphs 19

20 Incident subgraph sampling Consider a complementary design to induced subgraph sampling S) Sample n edges E without replacement (SRS) O) Observe vertices i V incident to those selected edges in E Ex: construction of sampled telephone call graphs Network Science Analytics Sampling and Estimation in Network Graphs 20

21 Inclusion probabilities For incident subgraph sampling, edge inclusion probabilities are π {i,j} = n N e Vertex in V if any one or more of its incident edges are sampled π i = P (vertex i is sampled) = 1 P (no edge incident to i is sampled) { ( Ne di ) n 1 ( = Ne ), if n N e d i n 1, if n > N e d i Vertices included with unequal probs. that depend on their degrees Probability proportional to size (degree) sampling of vertices Requires knowledge of N e and degree sequence {d i } i V Network Science Analytics Sampling and Estimation in Network Graphs 21

22 Snowball sampling S) Sample n vertices V 0 = {i 1,..., i n } without replacement (SRS) O1) Observe edges E0 incident to each i V 0, forming the initial wave O2) Observe neighbors N (V 0 ) of i V 0, i.e., V 1 = N (V 0 ) (V 0 )c Iterate to a desired number of e.g., k waves, or until Vk empty G has V = V0 V 1... V k, and their incident edges Ex: spiders or crawlers to discover the WWW s structure Network Science Analytics Sampling and Estimation in Network Graphs 22

23 Star sampling Difficult to compute inclusion probabilities beyond a single wave Single-wave snowball sampling reduces to star sampling Unlabeled: V = V0 and E = E0 their incident edges Ex: Count all co-authors of n sampled authors Vertex inclusion probabilities are simply πi = n/n v Labeled: V = V0 (N (V 0 ) (V 0 )c ) and E = E0 Ex: Count and identify all co-authors of n sampled authors Vertex inclusion probabilities can be shown to look like π i = L N i ( 1) L +1 P (L), where P (L) = ( Nv L n L ( Nv ) n ) Denoted by Ni the neighborhood of vertex i (including i itself) Network Science Analytics Sampling and Estimation in Network Graphs 23

24 Link tracing Link-tracing designs Select an initial sample of vertices VS Trace edges (links) from Vs to another set of vertices VT Snowball sampling: special case where all incident edges are traced May be infeasible to follow all incident edges to a given vertex Ex: lack of recollection/deception in social contact networks Path sampling designs Source nodes V S = {s 1,..., s ns } V Target nodes V T = {t 1,..., t nt } V \ V S Traverse and measure the path between each pair (s i, t j ) Ex: Traceroute Internet studies, Milgram s Six Degrees experiment Network Science Analytics Sampling and Estimation in Network Graphs 24

25 Traceroute sampling Trace shortest paths from each source to all targets s1 s2 t1 t2 Vertex and edge inclusion probabilities roughly [Dall Asta et al 06]: π i 1 (1 ρ S ρ T )e ρ S ρ T c Be (i) and π {i,j} 1 e ρ S ρ T c Be ({i,j}) Source and target sampling fractions ρ S := n S /N v and ρ T := n T /N v Induces PPS sampling, size given by betweenness centralities Network Science Analytics Sampling and Estimation in Network Graphs 25

26 Estimation of totals in network graphs Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 26

27 Network summaries as totals Various graph summaries η(g) are expressible in terms of totals τ Average degree: Let U = V and y i = d i, then η(g) = d i V d i Graph size: Let U = E and y ij = 1, then η(g) = N e = (i,j) E 1 Betweenness centrality: Let U = V (2) (unordered vertex pairs) and y ij = I { k P (i,j) }. For unique shortest i j paths P(i,j), then η(g) = c Be (k) = (i,j) V (2) I { k P(i,j) } Clustering coefficient: Let U = V (3) (unordered vertex triples), then η(g) = cl(g) = 3 total number of triangles total number of connected triples Often such totals can be obtained from sampled G via HT estimation Network Science Analytics Sampling and Estimation in Network Graphs 27

28 Vertex totals Vertex totals are of the form τ = i V y i, averages are τ/n v Ex: average degree where yi = d i Ex: nodes with characteristic C, where yi = I {i C} Given a sample V V, the HT estimator for vertex totals is ˆτ π = y i π i i V Variance expressions carry over, let U = V and V for estimates Inclusion probabilities π i depend on network sampling design Sampling also influences whether y i is observable, e.g., y i = d i Network Science Analytics Sampling and Estimation in Network Graphs 28

29 Totals on vertex pairs Quantity y ij corresponding to vertex pairs (i, j) V (2) of interest Totals τ = (i,j) V (2) y ij become relevant Ex: graph size Ne and betweenness c Be (k) where y ij = I { k P (i,j) } Ex: shared gender in friendship network, average dissimilarity The HT estimator in this context is ˆτ π = y ij π ij (i,j) V (2) Edge totals a special case, when y ij 0 only for (i, j) E Variance expression increasingly complicated, namely var [ˆτ π ] = ( ) πijkl y ik y kl 1 π ij π kl (i,j) V (2) (k,l) V (2) Depends on inclusion probabilities π ijkl of vertex quadruples Network Science Analytics Sampling and Estimation in Network Graphs 29

30 Example: Estimating network size Consider estimating N e as an edge total, i.e., N e = 1 = (i,j) E (i,j) V (2) A ij Bernoulli sampling (BS): I {i V } Ber(p) i.i.d. for all i V Edges E obtained via induced subgraph sampling π ij = p 2 The HT estimator of N e is ˆN e = (i,j) V (2) A ij π ij = p 2 N e Scales up the empirically observed edge total N e by p 2 > 1 Variance can be shown to take the form [Frank 77] [ ] var ˆN e = (p 1 1) di 2 + (p 2 2p 1 + 1)N e i V Network Science Analytics Sampling and Estimation in Network Graphs 30

31 Example: Estimating network size (cont.) Protein network: N v = 5, 151, N e = 31, 201 BS of vertices with p = 0.1 and p = 0.3 Process repeated for 10,000 trials histogram of ˆN e Density 0e+00 6e Density 0e+00 3e N^ e, for p=0.10 se^ (N^ e), for p=0.10 Density Density N^ e, for p=0.30 se^ (N^ e), for p=0.30 Average of ˆN e was 31, 116 and 31, 203 Unbiasedness supported Mean and variability of ŝe shrinks with p (larger sample) Network Science Analytics Sampling and Estimation in Network Graphs 31

32 Example: Estimating clustering coefficient Average clustering coefficient cl(g) can be expressed as cl(g) = 3 τ (G) τ 3 (G) Involves the quotient of two totals on vertex triples τ = (i,j,k) V (3) y ijk ˆτ π = Total number of triangles τ (G), where (i,j,k) V (3) y ijk π ijk y ijk = A ij A jk A ki Total number of connected triples τ 3 (G), where y ijk = A ij A jk + A ij A ki + A jk A ki Network Science Analytics Sampling and Estimation in Network Graphs 32

33 Example: Estimating clustering coefficient (cont.) Protein network: τ (G) = 44, 858, τ 3 (G) 1M, and cl(g) = BS of vertices with p = 0.2 Induced subgraph sampling of edges Process repeated for 10,000 trials histogram of ĉl(g) Density Estimated Clustering Coefficient Unbiased HT estimators ˆτ = p 3 τ (G ) and ˆτ 3 = p 3 τ 3 (G ) Plug-in estimator ĉl(g) = 3ˆτ /ˆτ 3 results in ĉl(g) = cl(g ) Quite accurate with mean and ŝe of Network Science Analytics Sampling and Estimation in Network Graphs 33

34 Caveat emptor Horvitz-Thompson framework fairly straightforward in its essence Success in network sampling and estimation rests on interaction among a) Sampling design; b) Measurements taken; and c) Total to be estimated Three basic elements must be present in the problem 1) Network summary statistic η(g) expressible as total; 2) Values y either observed, or obtainable from measurements; and 3) Inclusion probabilities π computable for the sampling design Unfortunately, often not all three are present at the same time... Network Science Analytics Sampling and Estimation in Network Graphs 34

35 Example: Estimating average degreee Recall our first example on estimation of average degree 1 N v i V d i Design 1: Unlabeled star sampling, observes degrees di, i V Design 2: Induced subgraph sampling, does not observe degrees Average degree is a scaling of a vertex total (N v known) HT estimation applicable so long as y i = d i observed True for unlabeled star sampling, and since π i = n/n v we have ˆµ St = ˆτ St N v, where ˆτ St = i V St d i n/n v We do not observe d i under induced subgraph sampling Not amenable to HT estimation as vertex total for this design Network Science Analytics Sampling and Estimation in Network Graphs 35

36 Example: Estimating average degreee (cont.) Identity µ = 2Ne N v For induced subgraph sampling π ij = ˆN e,is = Tackle instead as estimation of network size N e n(n 1) N v (N v 1), so HT estimator is A ij n(n 1)/[N v (N v 1)] = N v (N v 1) n(n 1) N e,is (i,j) V (2) Desired unbiased estimator for the average degree is ˆµ IS = 2 ˆN e,is N v Estimators under both designs can be compared by writing them as ˆµ St = 2N e,st n and ˆµ IS = 2N e,is. N v 1 n n 1 Design 1: uses the identity µ = 2Ne Design 2: same but inflated by N v on G St Nv 1 n 1, compensates d i,is < d i Network Science Analytics Sampling and Estimation in Network Graphs 36

37 Estimation of network group size Assuming that N v is known may not be on safe grounds Human or animal groups too mobile or elusive to count accurately All Web pages or Internet routers are too massive and dispersed Often estimating N v may well be the prime objective If vertex SRS or BS feasible, could sample twice marking in between Facilitates usage of capture-recapture estimators off-the-shelf If sampling infeasible, or capture-recapture performs poorly Develop estimators of N v tailored to the graph sampling at hand Network Science Analytics Sampling and Estimation in Network Graphs 37

38 Estimating the size of a hidden population Hidden population: individuals do not wish to expose themselves Ex: humans of socially sensitive status, such as homeless Ex: involved in socially sensitive activities, e.g., drugs, prostitution Such groups are often small Estimating their size is challenging Snowball sampling used to estimate the size of hidden populations O. Frank and T. Snijders, Estimating the size of hidden populations using snowball sampling, J. Official Stats., vol. 10, pp , 1994 Network Science Analytics Sampling and Estimation in Network Graphs 38

39 Sampling a hidden population Directed graph G(V, E), V the members of the hidden population Graph describing willingness to identify other members Arc (i, j) when ask individual i, mentions j as a member Graph G obtained via one-wave snowball sampling, i.e., V = V0 V 1 Initial sample V0 obtained via BS from V with probability p 0 Consider the following random variables (RVs) of interest N = V 0 : size of the initial sample M1: number of arcs among individuals in V0 M2: number of arcs from individuals in V0 to individuals in V1 Snowball sampling yields measurements n, m 1, and m 2 of these RVs Network Science Analytics Sampling and Estimation in Network Graphs 39

40 Method of moments estimator Method of moments: equate moments to sample counterparts [ ] E [N] = E I {i V0 } = N v p 0 = n E [M 1 ] = E j E [M 2 ] = E j i I {i V0 }I {j V0 }A ij = N e p0= 2 m 1 i j I {i V0 }I {j / V0 }A ij = N e p 0 (1 p 0 )= m 2 i j Expectation w.r.t. randomness in selecting the sample V 0. Solution: ( ) m1 + m 2 ˆN v = n m 1 Size of initial sample inflated by estimate of the sampling rate Network Science Analytics Sampling and Estimation in Network Graphs 40

41 Estimation of degree distributions Network sampling and challenges Background on statistical sampling theory Network graph sampling designs Estimation of network totals and group size Estimation of degree distributions Network Science Analytics Sampling and Estimation in Network Graphs 41

42 Estimation of other network characteristics Classical sampling theory rests heavily on Horvitz-Thompson theory Scope limited to network totals Q: Other network summaries, e.g., degree distributions? Findings on the effect of sampling on observed degree distributions: Highly unrepresentative of actual degree distributions; and Unhelpful to characterizing heterogeneous distributions Ex: Internet traceroute sampling [Lakhina et al 03] Broad degree distribution in G, while concentrated in G Ex: Sampling protein-protein interaction networks [Han et al 05] Power-law exponent estimate from G underestimates α in G Network Science Analytics Sampling and Estimation in Network Graphs 42

43 Impact of sampling on degree distribution Let N(d) denote the number of vertices with degree d in G Let N (d) be the counterpart in a sampled graph G Introduce vectors n = [N(0),..., N(d max )] and likewise n Under a variety of sampling designs, it holds that E [n ] = Pn Matrix P depends fully on the sampling, not G itself Expectation w.r.t. randomness in selecting the sample G O. Frank, Estimation of the number of vertices of different degrees in a graph, J. Stat. Planning and Inference, vol. 4, pp , 1980 Network Science Analytics Sampling and Estimation in Network Graphs 43

44 An inverse problem Recall the identity E [n ] = Pn Face a linear inverse problem Unbiased estimator of the degree distribution n ˆn naive = P 1 n While natural, two problems with this simple solution Matrix P typically not invertible in practice; and Non-negativity of the solution is not guaranteed We actually have an ill-posed linear inverse problem Network Science Analytics Sampling and Estimation in Network Graphs 44

45 Performance of naive estimator Erdös-Renyi graph with N v = 100 and N e = 500 BS of vertices with p = 0.6 Induced subgraph sampling of edges Network Science Analytics Sampling and Estimation in Network Graphs 45

46 Penalized least-squares formulation Constrained, penalized, weighted least-squares [Zhang et al 14] min (Pn n ) C 1 (Pn n ) + λpen(n) n s. to N(d) 0, d = 0, 1,..., d max, d max N(d) = N v d=1 Matrix C denotes the covariance of n Functional pen(n) penalizes complexity in n, tuned by λ Constraints Non-negativity of degree counts Total degree counts equal the number of vertices Smoothness: pen(n) = Dn 2, D differentiating operator Network Science Analytics Sampling and Estimation in Network Graphs 46

47 Application to online social networks Communities from online social networks Orkut and LiveJournal BS of vertices with p = 0.3 Induced subgraph sampling of edges True, sampled, and estimated degree distribution Network Science Analytics Sampling and Estimation in Network Graphs 47

48 Glossary Enumeration and samping Population graph Sampled graph Plug-in estimator Sampling design Sample with(out) replacement Design-based methods Averages and totals Inclusion probability Simple random sampling Bernoulli sampling Unequal probability sampling Horvitz-Thompson estimator Probability proportional to size sampling Capture-recapture estimator Induced subgraph sampling Incident subgraph sampling Snowball and star sampling Traceroute sampling Hidden population Ill-posed inverse problem Penalized least squares Network Science Analytics Sampling and Estimation in Network Graphs 48

Estimating network degree distributions from sampled networks: An inverse problem

Estimating network degree distributions from sampled networks: An inverse problem Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University kolaczyk@bu.edu Introduction: Networks and Degree