Parameter estimators of sparse random intersection graphs with thinned communities

Parameter estimators of sparse random intersection graphs with thinned communities Lasse Leskelä Aalto University Johan van Leeuwaarden Eindhoven University of Technology Joona Karjalainen Aalto University WAW 2018, Moscow, 17-19 May 2018

Aalto University, Finland Established in 2010 as a merger of Helsinki University of Technology Helsinki School of Economics University of Art and Design Helsinki 20 000 students 400 professors

Introduction: Triangle densities in graphs

Triangle densities Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012

Triangles in social networks Ugander, Backstrom, Kleinberg 2013

Triangles in social networks Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013

GOAL Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample

Network models Erdös-Rényi graphs Stochastic block models and graphons Uniform random graphs with given degree distribution Exponential random graphs Geometric random graphs Preferential attachment models

Random intersection graphs

Intersection graph 2 4 1 3 Nodes 1 2 3 Communities

Intersection graph 2 4 1 3 Nodes 1 2 3 Attributes Two nodes are connected when they share at least one community

Random intersection graph n nodes m communities

Random intersection graph V1 n nodes m communities

Random intersection graph V1 V2 n nodes m communities

Random intersection graph V1 V2 V3 n nodes m communities

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π), where π is the community size distribution [Godehardt, Jaworski 2001] [Bloznelis 2013] [Karjalainen, Leskelä 2017]

Random intersection graph The model can also be viewed as a random hypergraph with m hyperedges of random size Each community induces a clique -> Cliques overrepresented?

Thin random intersection graphs

Thin random intersection graph V1 V2 V3 n nodes m communities Random graph with n-by-n adjacency matrix

Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix

Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π,q), where q is the community strength

Sparse parameter regime Mean number of communities covering a set of r nodes: is characterized by the factorial moments The model is sparse (Pr(link) << 1) if and only if

Subgraph densities

Subgraph densities In the sparse parameter regime with [Karjalainen Leskelä 2017] [Dewar Healy Perez-Gimenez Pralat Proos Reiniger Ternovsky 2017]

Extremal triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012

Triangles in social networks Kruskal-Katona upper bound: #triangles (#links) 1.5 Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013

Triangles in thin random intersection graphs Kruskal-Katona upper bound: #triangles (#links) 1.5 Thin RIGs in the sparse regime #triangles ~ c (#links) 2 Erdös-Rényi graph #triangles = (#links) 3

Degree distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate When the limiting community size distribution is heavy-tailed, so is the limiting degree distribution Joint work with Mindaugas Bloznelis (upcoming) Downshifted sizebiased community size distribution

Transitivity

Transitivity Probability that two random neighbors of a random node in G are connected

Realized vs. model transitivity Transitivity of a graph realization (a random variable)

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number)

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3):

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3): Conditional probability given the graph realization

Model transitivity In the balanced sparse regime the model transitivity is

Model transitivity In the balanced sparse regime the model transitivity is The model displays nontrivial transitivity in the balanced sparse parameter regime.

Model transitivity In the balanced sparse regime the model transitivity is But is the same true for the transitivity of graph realizations? The model displays nontrivial transitivity in the balanced sparse parameter regime.

Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments

Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments YES, the transitivity of the realized graph (and every sufficiently large subgraph) agrees with the model transitivity, with high probability

Statistical estimation

Fitting model to data Can we learn the model parameters from one observed graph sample? Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Note: Maximum likelihood estimation is computationally intractable due to the nonlinear intersection map.

Induced subgraph sampling V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ). Can we infer the parameters of the statistical graph model from the observed subgraph?

MAIN RESULTS

Empirical subgraph counts Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3

Fitting the Bernoulli model Community size distribution: π = Bin(n,p) with rate parameter Number of communities Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The parameters μ and q can be solved from the mean degree λ, degree variance σ 2, and transitivity τ via

Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength

Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The model parameters can be consistently estimated from an induced subgraph of size n0 >> n 2/3 in O(Δn0) time

Numerical experiments

Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Realized value of λ^ Realized value of μ^

Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Expected value of λ^ Realized value of λ^ Realized value of μ^ Expected value of μ^

Fluctuations of the estimators Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750

Fluctuations of the estimators The fluctuations appear asymptotically Gaussian Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750

Data experiments

Fitting a Bernoulli model

Fitting a Bernoulli model Relatively strong communities identified for citation networks

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Two moderately strong communities identified for US airport network

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Future work: Can you identify the (overlapping) communities? Two moderately strong communities identified for US airport network

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Proofs

Proof ingredients Combinatorial analysis of graph patterns resulting from unions of links, 2-stars, and triangles Computing asymptotically the most likely bipartite pattern to induce a given graph motif Second moment method

Graph obtained as unions of overlapping triangles

Graph obtained as unions of overlapping 2-stars

Approximate densities of some graph patterns

Minimal covering families of some graph patterns

Summary

GOAL REVISITED Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample

SUMMARY Thin random intersection graphs are statistical graph models with overlapping communities: Nontrivial transitivity (clustering) Possibly heavy-tailed degree distributions Model parameters (one-parameter community size distribution) can be consistently estimated in O(Δn0) computational time from an observed subgraph of size n0 >> n 2/3

ONGOING/FUTURE WORK Extend to parametric families of community size distributions (e.g. power laws) Develop estimators for biased subgraph samples (network crawling, snowball sampling) Prove asymptotic normality Develop goodness-of-fit tests Learn the (overlapping) community structure