Sampling and incomplete network data

Size: px

Start display at page:

Download "Sampling and incomplete network data"

Ernest Murphy
5 years ago
Views:

1 1/58 Sampling and incomplete network data 567 Statistical analysis of social networks Peter Hoff Statistics, University of Washington

2 2/58 Network sampling methods It is sometimes difficult to obtain a complete network dataset: the population nodeset is too large; gathering all relational information is too costly; population nodes are hard to reach. In such cases, we need to think carefully how to gather the data (i.e. design the survey); make inference (i.e. estimate and evaluate parameters).

3 /58 Common sampling methods 1. node-induced subgraph sampling 2. edge-induced subgraph sampling 3. egocentric sampling 4. link tracing designs 5. censored nomination schemes

4 4/58 Node-induced subgraph sampling Procedure: 1. Uniformly sample a set s = {s 1,..., s ns } of nodes s {1,..., n}. 2. Observe relations y s between sampled nodes Y s = {y i,j : i s, j s}.

5 5/58 Node-induced subgraph sampling

6 6/58 Node-induced subgraph sampling

7 7/58 Node-induced subgraph sampling

8 8/58 Node-induced subgraph sampling

9 9/58 Node-induced subgraph sampling

10 10/58 Node-induced subgraph sampling

11 11/58 Estimation from sampled data In what ways does Y s resemble Y? For what functions g() will g(y s) estimate g(y)? Consider the following setup: n n sociomatrix Y n n dyadic covariate X d n 1 nodal covariate X n Can we estimate the following from a sample? ȳ = 1 y n(n 1) i,j, x d = 1 x n(n 1) d,i,j, x n = 1 xn,i n i j yx d = 1 n(n 1) i j y i,j x d,i,j, yx n = i j 1 n(n 1) x n,i ȳ i i

12 12/58 Node-induced subgraph sampling y x d x n yx d yx n

13 13/58 Node-induced subgraph sampling For some functions g, the sample value g(y s) is an unbiased estimator of the population value g(y): g(y) = an average of subgraphs of size k, for k n s g(y) = ( 1 h(y n i,j, y j,i ) 2) i<j g(y) = ( 1 h(y n i,j, y j,i, y i,k, y k,i, y j,k, y k,j ) if n s 3 3) i<j<k Why does it work?: Each subgraph of size k appears in the sample with equal probability (although the subgraphs that appear are dependent). Some functions of interest are not of this type: in and outdegree distributions; geodesics, distances, number of paths, etc.

14 14/58 Edge-induced subgraph sampling Procedure: 1. Uniformly sample a set e = {e 1,..., e ne } of edges e {(i, j) : y i,j = 1} 2. Let Y s be the edge-generated subgraph of e.

15 15/58 Edge-induced subgraph sampling

16 16/58 Edge-induced subgraph sampling

17 17/58 Edge-induced subgraph sampling

18 18/58 Edge-induced subgraph sampling

19 19/58 Edge-induced subgraph sampling

20 Edge-induced subgraph sampling 20/58

21 21/58 Edge-induced subgraph sampling How well do these subgraphs represent Y? Can you infer anything about Y from these data?

22 22/58 Edge-induced subgraph sampling y x d x n yx d yx n

23 23/58 Egocentric sampling Procedure: 1. Uniformly sample a set s 1 = {s 1,1,..., s 1,ns } of nodes s 1 {1,..., n}. 2. Observe the relations for each i s 1, i.e. observe {y i,1,..., y i,n }. 3. Let s 2 be the set of nodes having a link from anyone in s 1. Observe the relations of anyone in s 2 to anyone in s 1 s 2. Y s = {y i,j : i, j s 1 s 2} For large graphs, these data can be obtained (with high probability) by asking each i s 1 the following: 1. Who are your friends? 2. Among your friends, which are friends with each other?

24 24/58 Link-tracing designs Snowball sampling: Iteratively repeat the egocentric sampler, obtaining the stage-k nodes s k from the links of s k 1. This is a type of link-tracing design. The links of the current nodes determine who is next to be included in the sample. How will such subgraphs Y s be similar to Y? How will they differ?

25 25/58 Egocentric sampling y x d x n yx d yx n

26 26/58 Egocentric sampling

27 27/58 Egocentric sampling

28 28/58 Egocentric sampling

29 Egocentric sampling 29/58

30 Egocentric sampling 30/58

31 31/58 Egocentric sampling

32 32/58 Egocentric sampling

33 33/58 Egocentric sampling

34 34/58 Egocentric sampling

35 35/58 Egocentric sampling

36 6/58 Inference with egocentric samples Y s is not generally representative of Y. For some statistics, weighted averages based on Y s can be unbiased (Horwitz-Thompson estimator). For many statistics, part of Y s can be used to obtain good estimates: degree distributions can be estimated from degrees of egos; covariate distributions can be estimated from those of the egos; However, use of data from s 2 generally requires a reweighting scheme.

37 37/58 References Snijders (1992), Estimation on the basis of snowball samples: How to Weight? Kolaczyk (2009) Sampling and Estimation in Network Graphs, chapter 5 of Statistical Analysis of Network Data.

38 38/58 Parameter estimation with incomplete sampled data Model: Pr(Y = y θ), θ Θ. Complete data: Y Observed data: Y[O], where O is a set of pairs of indices i 1 j 1 i 2 j 2 O = i 3 j 3.. How can we make inference for θ based on Y[O]? i s j s

39 9/58 Study design and missing data Node-induced subgraph sampling NA NA NA NA NA NA

40 40/58 Study design and missing data Node-induced subgraph sampling NA NA NA NA NA NA

41 41/58 Study design and missing data Node-induced subgraph sampling NA NA NA NA NA NA

42 42/58 Study design and missing data Node-induced subgraph sampling: Observed data NA NA NA NA NA NA 2 NA NA 1 NA 0 NA 3 NA 0 NA NA 0 NA 4 NA NA NA NA NA NA 5 NA 0 1 NA NA NA 6 NA NA NA NA NA NA

43 3/58 Study design and missing data Edge-induced subgraph sampling NA NA NA NA NA NA

44 44/58 Study design and missing data Edge-induced subgraph sampling NA NA NA NA NA NA

45 45/58 Study design and missing data Edge-induced subgraph sampling: Observed data NA NA NA NA NA 1 2 NA NA 1 NA NA NA 3 1 NA NA NA NA NA 4 NA NA NA NA NA NA 5 NA NA 1 NA NA NA 6 NA NA 1 NA NA NA

46 6/58 Study design and missing data Egocentric sampling NA NA NA NA NA NA

47 47/58 Study design and missing data Egocentric sampling NA NA NA NA NA NA

48 48/58 Study design and missing data Egocentric sampling NA NA NA NA NA NA

49 49/58 Study design and missing data Egocentric sampling NA NA NA NA NA NA

50 50/58 Study design and missing data Egocentric sampling NA NA NA NA NA NA 2 0 NA NA 0 NA NA NA 0 4 NA NA NA NA NA NA 5 NA NA NA NA NA NA 6 NA 0 1 NA NA NA

51 1/58 Parameter estimation with missing data If the data are missing at random, i.e. the value of o, what you get to observe, doesn t depend on θ doesn t depend on values of Y, then valid likelihood and Bayesian inference can be obtained from the observed-data likelihood: l MAR (θ : y[o]) = Pr(Y[o] = y[o] : θ) = y[o c ] Pr(Y = y : θ) Inference based on l(θ : y[o]) is provided in amen: put NA s in place of any non-observed relations.

52 52/58 Missing at random designs Which designs we ve discussed correspond to MAR relations? Node-induced subgraph sampling? Edge-induced subgraph sampling? Egocentric sampling?

53 53/58 Ignorable designs While egocentric and other link-tracing designs are not MAR, they still can be analyzed as if they were. The argument is as follows: The data include O = o, the determination of which relations you get to see; Y[O] = y[o], the relationship values for the observable relations. The likelihood is then l(θ : o, y[o]) = Pr(Y[o] = y[o], O = o θ) = Pr(Y[o] = y[o] θ) Pr(O = o θ, Y[o] = y[o]) = l MAR (θ : y[o]) Pr(O = o θ, Y[o] = y[o]) If the design part doesn t depend on θ, then the observed likelihood is proportional to the MAR likelihood, and the design can be ignored.

54 54/58 Ignorable designs l(θ : o, y[o]) = l MAR (θ : y[o]) Pr(O = o θ, Y[o] = y[o]) When is the design ignorable? (MAR) If the probability that O equals o doesn t depend on θ or Y (e.g., node-induced subgraph sampling), the design is ignorable. ID If the probability that O equals o doesn t depend on θ only depends on Y through Y[o]. then the design is ignorable. The latter conditions are often met for link tracing designs, like egocentric and snowball sampling.

55 55/58 References Thompson and Frank (2000) Model-based estimation with link-tracing sampling designs Heitjan and Basu (1996) Distinguishing Missing at Random and Missing Completely at Random

56 56/58 Simulation study - ID likelihoods y i,j = β 0 + β r x n,i + β cx n,j + β d,i,j + a i + b j + ɛ i,j fit.pop = fitted model based on complete network data fit.samp = fitted model based on sampled network data How do the parameter estimates of fit.samp compare to those of fit.pop?

57 57/58 Node-induced subgraph sample n p = 32, n s = β β r β c β d

58 58/58 Egocentric sample n p = 32, n s1 = β β r β c β d

Appendix: Modeling Approach

AFFECTIVE PRIMACY IN INTRAORGANIZATIONAL TASK NETWORKS Appendix: Modeling Approach There is now a significant and developing literature on Bayesian methods in social network analysis. See, for instance,