Parameter estimators of sparse random intersection graphs with thinned communities Lasse Leskelä Aalto University Johan van Leeuwaarden Eindhoven University of Technology Joona Karjalainen Aalto University WAW 2018, Moscow, 17-19 May 2018
Aalto University, Finland Established in 2010 as a merger of Helsinki University of Technology Helsinki School of Economics University of Art and Design Helsinki 20 000 students 400 professors
Introduction: Triangle densities in graphs
Triangle densities Lovász 2012
Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Lovász 2012
Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Lovász 2012
Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012
Triangles in social networks Ugander, Backstrom, Kleinberg 2013
Triangles in social networks Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013
GOAL Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample
Network models Erdös-Rényi graphs Stochastic block models and graphons Uniform random graphs with given degree distribution Exponential random graphs Geometric random graphs Preferential attachment models
Random intersection graphs
Intersection graph 2 4 1 3 Nodes 1 2 3 Communities
Intersection graph 2 4 1 3 Nodes 1 2 3 Communities
Intersection graph 2 4 1 3 Nodes 1 2 3 Communities
Intersection graph 2 4 1 3 Nodes 1 2 3 Attributes Two nodes are connected when they share at least one community
Random intersection graph n nodes m communities
Random intersection graph V1 n nodes m communities
Random intersection graph V1 V2 n nodes m communities
Random intersection graph V1 V2 V3 n nodes m communities
Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size
Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix
Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π), where π is the community size distribution [Godehardt, Jaworski 2001] [Bloznelis 2013] [Karjalainen, Leskelä 2017]
Random intersection graph The model can also be viewed as a random hypergraph with m hyperedges of random size Each community induces a clique -> Cliques overrepresented?
Thin random intersection graphs
Thin random intersection graph V1 V2 V3 n nodes m communities Random graph with n-by-n adjacency matrix
Thin random intersection graph V1 V2 V3 n nodes m communities Random graph with n-by-n adjacency matrix
Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix
Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π,q), where q is the community strength
Sparse parameter regime Mean number of communities covering a set of r nodes: is characterized by the factorial moments The model is sparse (Pr(link) << 1) if and only if
Subgraph densities
Subgraph densities In the sparse parameter regime with [Karjalainen Leskelä 2017] [Dewar Healy Perez-Gimenez Pralat Proos Reiniger Ternovsky 2017]
Extremal triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012
Triangles in social networks Kruskal-Katona upper bound: #triangles (#links) 1.5 Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013
Triangles in thin random intersection graphs Kruskal-Katona upper bound: #triangles (#links) 1.5 Thin RIGs in the sparse regime #triangles ~ c (#links) 2 Erdös-Rényi graph #triangles = (#links) 3
Degree distribution
Degree distribution
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate Downshifted sizebiased community size distribution
Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate When the limiting community size distribution is heavy-tailed, so is the limiting degree distribution Joint work with Mindaugas Bloznelis (upcoming) Downshifted sizebiased community size distribution
Transitivity
Transitivity Probability that two random neighbors of a random node in G are connected
Realized vs. model transitivity Transitivity of a graph realization (a random variable)
Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number)
Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3):
Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3): Conditional probability given the graph realization
Model transitivity In the balanced sparse regime the model transitivity is
Model transitivity In the balanced sparse regime the model transitivity is The model displays nontrivial transitivity in the balanced sparse parameter regime.
Model transitivity In the balanced sparse regime the model transitivity is But is the same true for the transitivity of graph realizations? The model displays nontrivial transitivity in the balanced sparse parameter regime.
Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments
Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments YES, the transitivity of the realized graph (and every sufficiently large subgraph) agrees with the model transitivity, with high probability
Statistical estimation
Fitting model to data Can we learn the model parameters from one observed graph sample? Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Note: Maximum likelihood estimation is computationally intractable due to the nonlinear intersection map.
Induced subgraph sampling V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).
Induced subgraph sampling V (n 0 ) V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).
Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).
Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ). Can we infer the parameters of the statistical graph model from the observed subgraph?
MAIN RESULTS
Empirical subgraph counts Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3
Fitting the Bernoulli model Community size distribution: π = Bin(n,p) with rate parameter Number of communities Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The parameters μ and q can be solved from the mean degree λ, degree variance σ 2, and transitivity τ via
Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength
Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The model parameters can be consistently estimated from an induced subgraph of size n0 >> n 2/3 in O(Δn0) time
Numerical experiments
Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Realized value of λ^ Realized value of μ^
Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Expected value of λ^ Realized value of λ^ Realized value of μ^ Expected value of μ^
Fluctuations of the estimators Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750
Fluctuations of the estimators The fluctuations appear asymptotically Gaussian Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750
Data experiments
Fitting a Bernoulli model
Fitting a Bernoulli model
Fitting a Bernoulli model
Fitting a Bernoulli model Relatively strong communities identified for citation networks
Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data
Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Two moderately strong communities identified for US airport network
Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Future work: Can you identify the (overlapping) communities? Two moderately strong communities identified for US airport network
Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1
Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1
Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1
Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1
Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1
Proofs
Proof ingredients Combinatorial analysis of graph patterns resulting from unions of links, 2-stars, and triangles Computing asymptotically the most likely bipartite pattern to induce a given graph motif Second moment method
Graph obtained as unions of overlapping triangles
Graph obtained as unions of overlapping 2-stars
Approximate densities of some graph patterns
Minimal covering families of some graph patterns
Summary
GOAL REVISITED Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample
SUMMARY Thin random intersection graphs are statistical graph models with overlapping communities: Nontrivial transitivity (clustering) Possibly heavy-tailed degree distributions Model parameters (one-parameter community size distribution) can be consistently estimated in O(Δn0) computational time from an observed subgraph of size n0 >> n 2/3
ONGOING/FUTURE WORK Extend to parametric families of community size distributions (e.g. power laws) Develop estimators for biased subgraph samples (network crawling, snowball sampling) Prove asymptotic normality Develop goodness-of-fit tests Learn the (overlapping) community structure