Network Degree Distribution Inference Under Sampling

Size: px

Start display at page:

Download "Network Degree Distribution Inference Under Sampling"

Ann Richard
5 years ago
Views:

1 Network Degree Distribution Inference Under Sampling Aleksandrina Goeva 1 Eric Kolaczyk 1 Rich Lehoucq 2 1 Department of Mathematics and Statistics, Boston University 2 Sandia National Labs August 18, 2016 BU/Keio, Boston, MA

2 Motivation Sampling introduces randomness in the sampled network. Sampled network characteristics may not represent those of the true network well.

3 Sampling Mechanisms - Examples 1 1 Kolaczyk (2009)

4 Setup Problem introduced by Frank (1968) EN = PN N =(N 0, N 1,...,N M ), is the degree counts vector of the true network, N =(N0, N 1,...,N M ), is the degree counts vector of the sampled network, P is a linear operator that depends fully on the sampling scheme and not on the network itself, and M is the maximum degree in the true network G

5 2 Zhang, Kolaczyk, Spencer (2015) Naive Solution Issues bn naive = P 1 N P is typically non-invertible. Solutions may not be non-negative. 2

6 Problem Formulation N = PN + E Ill-posed linear inverse problem. P is not random, depends only on the sampling design. E is the noise due to sampling. E[E] =0 E[EE T ]=C

7 Proposed Approach Complexity Functional K( e N, ) =(P e N ) T C 1 (P e N )+ D e N 2 2 where is a regularization parameter D is a second-order di erencing operator C = Cov(N )=E[EE T ] Look for a constrained solution en 2C:= { e N : e N 0and1 T e N = nv }

8 C-constrained Minimum Empirical Complexity Estimate Constrained Penalized Weighted Least Squares min e N subject to (P e N N ) T C 1 (P e N N )+ D e N 2 2 e N 2C C-constrained minimum empirical complexity estimate: bn = argmin K( N, e N ) en2c

9 Quality of the Solution We aim to upperbound the risk: E[ P b N PN 2 C 1 ] = E[(P b N PN) T C 1 (P b N PN)] apple E[K( b N, PN)] apple K(N 0, PN) +2 E[< C 1/2 E, C 1/2 (P b N PN 0 ) >] where N 0 = argmin K( N, e PN) en2c is the C-constrained minimizer of theoretical complexity.

10 First Term K(N 0, PN) This is the minimum theoretical complexity. This term is not random. Bounded in terms of a functional of the sampling design.

11 Di erent Regimes Underlying all sampling mechanisms there is a fundamental quantity p controlling the rate of sampling. The problem behaves di erently depending on the values of p. We identify three regimes: p = 1: full information - trivial case, no noise, P is diagonal. small p: the distribution of E is approximately Poisson. moderate p: the distribution of E is approximately Normal.

12 Di erent Regimes - Small p Small p 10% to 20%: the distribution of E is appoximately Poisson. Sample Quantiles Poisson Theoretical Quantiles

13 Di erent Regimes - Moderate p Moderate p 30% to 60% the distribution of E is appoximately Normal. Normal Q Q Plot Sample Quantiles Theoretical Quantiles

14 Second Term E[< C 1/2 E, C 1/2 (PN b PN 0 ) >] " # applee sup < C 1/2 E, C 1/2 (PN b PN 0 ) > PN b PN 0 2set Under the moderate p regime, the distribution of E is reasonably close to Gaussian. Assuming the entries of C 1/2 E are independent standard Gaussian, we can bound this term using Gaussian widths.

15 Summary Motivation: Problem arises in the context of sampled networks. Under many sampling designs the expectation of the sampled degree distribution is the product of a design-dependent matrix and the true underlying degree distribution. Main Idea: Unusual ill-conditioned linear inverse problem. The empirical analysis of Zhang, et al. (2015) of the constrained penalized weighted least squares solution is the first non-parametric approach to the problem since it was proposed 35 years ago. To our knowledge, our work is the first attempt to produce theoretical guarantees on the performance of the proposed solution. Thank you!

Estimating network degree distributions from sampled networks: An inverse problem

Estimating network degree distributions from sampled networks: An inverse problem Eric D. Kolaczyk Dept of Mathematics and Statistics, Boston University kolaczyk@bu.edu Introduction: Networks and Degree