Parameter estimators of sparse random intersection graphs with thinned communities

Similar documents
arxiv: v1 [math.pr] 4 Feb 2018

Asymptotic normality of global clustering coefficient of uniform random intersection graph

Ground states for exponential random graphs

Network models: random graphs

Recent Progress in Complex Network Analysis. Properties of Random Intersection Graphs

Sharp threshold functions for random intersection graphs via a coupling method.

Supporting Statistical Hypothesis Testing Over Graphs

Diclique clustering in a directed network

Network models: dynamical growth and small world

1 Mechanistic and generative models of network structure

Graph Detection and Estimation Theory

Consistency Under Sampling of Exponential Random Graph Models

Undecidability of Linear Inequalities Between Graph Homomorphism Densities

Degree distribution of an inhomogeneous random intersection graph

Statistical and Computational Phase Transitions in Planted Models

Statistical analysis of biological networks.

Lecture 6: Gaussian Mixture Models (GMM)

Erdős-Rényi random graph

Szemerédi s regularity lemma revisited. Lewis Memorial Lecture March 14, Terence Tao (UCLA)

Modeling of Growing Networks with Directional Attachment and Communities

3.2 Configuration model

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Graph limits Graph convergence Approximate asymptotic properties of large graphs Extremal combinatorics/computer science : flag algebra method, proper

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

The chromatic number of random regular graphs

Characterizing extremal limits

Applications of the Lopsided Lovász Local Lemma Regarding Hypergraphs

Learning Objectives for Stat 225

Reconstruction in the Generalized Stochastic Block Model

The Union and Intersection for Different Configurations of Two Events Mutually Exclusive vs Independency of Events

arxiv: v1 [math.co] 21 Sep 2017

A Random Dot Product Model for Weighted Networks arxiv: v1 [stat.ap] 8 Nov 2016

6.207/14.15: Networks Lecture 12: Generalized Random Graphs

Uncovering structure in biological networks: A model-based approach

The Lopsided Lovász Local Lemma

An Efficient reconciliation algorithm for social networks

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Learning from Sensor Data: Set II. Behnaam Aazhang J.S. Abercombie Professor Electrical and Computer Engineering Rice University

Adventures in random graphs: Models, structures and algorithms

1 Complex Networks - A Brief Overview

Clustering means geometry in sparse graphs. Dmitri Krioukov Northeastern University Workshop on Big Graphs UCSD, San Diego, CA, January 2016

GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018)

CMPUT651: Differential Privacy

Asymptotics and Extremal Properties of the Edge-Triangle Exponential Random Graph Model

Random Networks. Complex Networks CSYS/MATH 303, Spring, Prof. Peter Dodds

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Purnamrita Sarkar (Carnegie Mellon) Deepayan Chakrabarti (Yahoo! Research) Andrew W. Moore (Google, Inc.)

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

Subject CS1 Actuarial Statistics 1 Core Principles

How do we analyze, evaluate, solve, and graph quadratic functions?

Concentration of Measures by Bounded Couplings

CS224W: Analysis of Networks Jure Leskovec, Stanford University

An Introduction to Exponential-Family Random Graph Models

Graphical Model Inference with Perfect Graphs

Concentration of Measures by Bounded Size Bias Couplings

Notes 6 : First and second moment methods

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

Networks: Lectures 9 & 10 Random graphs

Complex (Biological) Networks

Independence and chromatic number (and random k-sat): Sparse Case. Dimitris Achlioptas Microsoft

Stat 5101 Lecture Notes

Information Aggregation in Complex Dynamic Networks

Large cliques in sparse random intersection graphs

A New Space for Comparing Graphs

A Vector Space Analog of Lovasz s Version of the Kruskal-Katona Theorem

Self Similar (Scale Free, Power Law) Networks (I)

Lecture 12: May 09, Decomposable Graphs (continues from last time)

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Theory and Methods for the Analysis of Social Networks

Large cliques in sparse random intersection graphs

BIRTHDAY PROBLEM, MONOCHROMATIC SUBGRAPHS & THE SECOND MOMENT PHENOMENON / 23

p L yi z n m x N n xi

Random Graphs. 7.1 Introduction

Bipartite decomposition of random graphs

Complex networks: an introduction

Two hours. To be supplied by the Examinations Office: Mathematical Formula Tables THE UNIVERSITY OF MANCHESTER. 21 June :45 11:45

Extreme eigenvalues of Erdős-Rényi random graphs

Fundamentals of Applied Probability and Random Processes

Delay and Accessibility in Random Temporal Networks

The large deviation principle for the Erdős-Rényi random graph

Sampling and Estimation in Network Graphs

Algorithms Reading Group Notes: Provable Bounds for Learning Deep Representations

CSC 412 (Lecture 4): Undirected Graphical Models

Decomposition of random graphs into complete bipartite graphs

Random Walk Based Algorithms for Complex Network Analysis

Cluster Graph Modification Problems

Page Max. Possible Points Total 100

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Histogram Arithmetic under Uncertainty of. Probability Density Function

Combinatorics in Hungary and Extremal Set Theory

Critical percolation on networks with given degrees. Souvik Dhara

Induced Turán numbers

CSE 3500 Algorithms and Complexity Fall 2016 Lecture 25: November 29, 2016

Statistical Inference for Networks. Peter Bickel

Decision Making and Social Networks

The Geometry of Random Right-angled Coxeter Groups

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

CS242: Probabilistic Graphical Models Lecture 4A: MAP Estimation & Graph Structure Learning

Bayesian Models in Machine Learning

Transcription:

Parameter estimators of sparse random intersection graphs with thinned communities Lasse Leskelä Aalto University Johan van Leeuwaarden Eindhoven University of Technology Joona Karjalainen Aalto University WAW 2018, Moscow, 17-19 May 2018

Aalto University, Finland Established in 2010 as a merger of Helsinki University of Technology Helsinki School of Economics University of Art and Design Helsinki 20 000 students 400 professors

Introduction: Triangle densities in graphs

Triangle densities Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Lovász 2012

Triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012

Triangles in social networks Ugander, Backstrom, Kleinberg 2013

Triangles in social networks Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013

GOAL Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample

Network models Erdös-Rényi graphs Stochastic block models and graphons Uniform random graphs with given degree distribution Exponential random graphs Geometric random graphs Preferential attachment models

Random intersection graphs

Intersection graph 2 4 1 3 Nodes 1 2 3 Communities

Intersection graph 2 4 1 3 Nodes 1 2 3 Communities

Intersection graph 2 4 1 3 Nodes 1 2 3 Communities

Intersection graph 2 4 1 3 Nodes 1 2 3 Attributes Two nodes are connected when they share at least one community

Random intersection graph n nodes m communities

Random intersection graph V1 n nodes m communities

Random intersection graph V1 V2 n nodes m communities

Random intersection graph V1 V2 V3 n nodes m communities

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix

Random intersection graph V1 V2 V3 n nodes m communities Communities V1,, Vm are independent random sets with random size Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π), where π is the community size distribution [Godehardt, Jaworski 2001] [Bloznelis 2013] [Karjalainen, Leskelä 2017]

Random intersection graph The model can also be viewed as a random hypergraph with m hyperedges of random size Each community induces a clique -> Cliques overrepresented?

Thin random intersection graphs

Thin random intersection graph V1 V2 V3 n nodes m communities Random graph with n-by-n adjacency matrix

Thin random intersection graph V1 V2 V3 n nodes m communities Random graph with n-by-n adjacency matrix

Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix

Thin random intersection graph V1 V2 V3 n nodes m communities Cij,k are independent {0,1}-valued random variables with mean q Random graph with n-by-n adjacency matrix Statistical model parametrized by (n,m,π,q), where q is the community strength

Sparse parameter regime Mean number of communities covering a set of r nodes: is characterized by the factorial moments The model is sparse (Pr(link) << 1) if and only if

Subgraph densities

Subgraph densities In the sparse parameter regime with [Karjalainen Leskelä 2017] [Dewar Healy Perez-Gimenez Pralat Proos Reiniger Ternovsky 2017]

Extremal triangle densities Kruskal-Katona upper bound: #triangles (#links) 1.5 Razborov lower bound Sparse graphs live here Lovász 2012

Triangles in social networks Kruskal-Katona upper bound: #triangles (#links) 1.5 Erdös-Rényi graph #triangles = (#links) 3 Ugander, Backstrom, Kleinberg 2013

Triangles in thin random intersection graphs Kruskal-Katona upper bound: #triangles (#links) 1.5 Thin RIGs in the sparse regime #triangles ~ c (#links) 2 Erdös-Rényi graph #triangles = (#links) 3

Degree distribution

Degree distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate Downshifted sizebiased community size distribution

Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Degree distribution Compound Poisson distribution k-fold convolution Poisson rate When the limiting community size distribution is heavy-tailed, so is the limiting degree distribution Joint work with Mindaugas Bloznelis (upcoming) Downshifted sizebiased community size distribution

Transitivity

Transitivity Probability that two random neighbors of a random node in G are connected

Realized vs. model transitivity Transitivity of a graph realization (a random variable)

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number)

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3):

Realized vs. model transitivity Transitivity of a graph realization (a random variable) Model transitivity (deterministic number) For a random node triplet (I1,I2,I3): Conditional probability given the graph realization

Model transitivity In the balanced sparse regime the model transitivity is

Model transitivity In the balanced sparse regime the model transitivity is The model displays nontrivial transitivity in the balanced sparse parameter regime.

Model transitivity In the balanced sparse regime the model transitivity is But is the same true for the transitivity of graph realizations? The model displays nontrivial transitivity in the balanced sparse parameter regime.

Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments

Realized transitivity In the balanced sparse regime the model transitivity is whp when the community size distributions has bounded 6th factorial moments YES, the transitivity of the realized graph (and every sufficiently large subgraph) agrees with the model transitivity, with high probability

Statistical estimation

Fitting model to data Can we learn the model parameters from one observed graph sample? Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3 Note: Maximum likelihood estimation is computationally intractable due to the nonlinear intersection map.

Induced subgraph sampling V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) V We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ).

Induced subgraph sampling V (n 0 ) We observe a subgraph G (n 0 ) induced by a node set V (n 0 ). Can we infer the parameters of the statistical graph model from the observed subgraph?

MAIN RESULTS

Empirical subgraph counts Balanced sparse regime: Number of communities: m/n ~ const Community strength: q ~ const Community size distribution: (π)r ~ const for r=1,2,3

Fitting the Bernoulli model Community size distribution: π = Bin(n,p) with rate parameter Number of communities Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The parameters μ and q can be solved from the mean degree λ, degree variance σ 2, and transitivity τ via

Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength

Consistent moment estimators Model parametrized by (λ,μ,q) λ is the mean degree μ is the mean number of communities covering a node q is the community strength The model parameters can be consistently estimated from an induced subgraph of size n0 >> n 2/3 in O(Δn0) time

Numerical experiments

Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Realized value of λ^ Realized value of μ^

Simulated Bernoulli model (q=1), λ = 9, μ = 3, n0 = n Expected value of λ^ Realized value of λ^ Realized value of μ^ Expected value of μ^

Fluctuations of the estimators Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750

Fluctuations of the estimators The fluctuations appear asymptotically Gaussian Empirical distribution of μ^ computed from 1000 simulations, λ = 9, μ = 3, n0 = n = 750

Data experiments

Fitting a Bernoulli model

Fitting a Bernoulli model

Fitting a Bernoulli model

Fitting a Bernoulli model Relatively strong communities identified for citation networks

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Two moderately strong communities identified for US airport network

Fitting a Bernoulli model Relatively strong communities identified for citation networks Moderately strong communities identified for email, Facebook and Flickr data Future work: Can you identify the (overlapping) communities? Two moderately strong communities identified for US airport network

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Fitting a standard random intersection graph (q=1) Fitted values when for a model with q=1

Proofs

Proof ingredients Combinatorial analysis of graph patterns resulting from unions of links, 2-stars, and triangles Computing asymptotically the most likely bipartite pattern to induce a given graph motif Second moment method

Graph obtained as unions of overlapping triangles

Graph obtained as unions of overlapping 2-stars

Approximate densities of some graph patterns

Minimal covering families of some graph patterns

Summary

GOAL REVISITED Develop a statistical graph model with Nontrivial transitivity (clustering) Heavy-tailed degree distributions A small number of parameters that can be consistently estimated in reasonable computational time from an observed graph sample

SUMMARY Thin random intersection graphs are statistical graph models with overlapping communities: Nontrivial transitivity (clustering) Possibly heavy-tailed degree distributions Model parameters (one-parameter community size distribution) can be consistently estimated in O(Δn0) computational time from an observed subgraph of size n0 >> n 2/3

ONGOING/FUTURE WORK Extend to parametric families of community size distributions (e.g. power laws) Develop estimators for biased subgraph samples (network crawling, snowball sampling) Prove asymptotic normality Develop goodness-of-fit tests Learn the (overlapping) community structure