Deciphering and modeling heterogeneity in interaction networks

Similar documents
Goodness of fit of logistic models for random graphs: a variational Bayes approach

Variational inference for the Stochastic Block-Model

IV. Analyse de réseaux biologiques

Modeling heterogeneity in random graphs

Variational Bayes model averaging for graphon functions and motif frequencies inference in W-graph models

Bayesian methods for graph clustering

MODELING HETEROGENEITY IN RANDOM GRAPHS THROUGH LATENT SPACE MODELS: A SELECTIVE REVIEW

The Random Subgraph Model for the Analysis of an Ecclesiastical Network in Merovingian Gaul

arxiv: v2 [stat.ap] 24 Jul 2010

Lecture 6: Graphical Models: Learning

Machine Learning Summer School

Uncovering structure in biological networks: A model-based approach

Mixtures and Hidden Markov Models for analyzing genomic data

Mixture models for analysing transcriptome and ChIP-chip data

Unsupervised Learning

Variable selection for model-based clustering

Statistical analysis of biological networks.

Bayesian Methods for Graph Clustering

13: Variational inference II

Introduction to Machine Learning

Pattern Recognition and Machine Learning. Bishop Chapter 9: Mixture Models and EM

Variational Scoring of Graphical Model Structures

G8325: Variational Bayes

An introduction to Variational calculus in Machine Learning

STA 4273H: Statistical Machine Learning

Bayesian Machine Learning - Lecture 7

Lecture 13 : Variational Inference: Mean Field Approximation

CSC2535: Computation in Neural Networks Lecture 7: Variational Bayesian Learning & Model Selection

Bayesian non parametric inference of discrete valued networks

Basic math for biology

Cheng Soon Ong & Christian Walder. Canberra February June 2017

IEOR E4570: Machine Learning for OR&FE Spring 2015 c 2015 by Martin Haugh. The EM Algorithm

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Expectation Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

U-Likelihood and U-Updating Algorithms: Statistical Inference in Latent Variable Models

Linear Dynamical Systems

STA 4273H: Statistical Machine Learning

Theory and Methods for the Analysis of Social Networks

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Variational Inference (11/04/13)

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Variational Inference. Sargur Srihari

Probabilistic Graphical Models

Gaussian Mixture Models

Topics in Network Models. Peter Bickel

Latent Variable Models and EM algorithm

Variational inference

The Expectation Maximization or EM algorithm

Probabilistic Graphical Models

Conditional Random Field

Lecture 6: April 19, 2002

Recent Advances in Bayesian Inference Techniques

Chapter 16. Structured Probabilistic Models for Deep Learning

Statistical Pattern Recognition

Statistical Machine Learning Lecture 8: Markov Chain Monte Carlo Sampling

Note Set 5: Hidden Markov Models

p L yi z n m x N n xi

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Bayesian Inference Course, WTCN, UCL, March 2013

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Expectation Maximization

Probabilistic Graphical Models for Image Analysis - Lecture 4

Consistency of the maximum likelihood estimator for general hidden Markov models

STA 414/2104: Machine Learning

Week 3: The EM algorithm

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

State Space and Hidden Markov Models

Variational Mixture of Gaussians. Sargur Srihari

Degree-based goodness-of-fit tests for heterogeneous random graph models : independent and exchangeable cases

Probabilistic Graphical Models

Mixtures of Gaussians. Sargur Srihari

Expectation Maximization

Some Statistical Models and Algorithms for Change-Point Problems in Genomics

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Non-Parametric Bayes

Random Effects Models for Network Data

Learning latent structure in complex networks

ECE521 Tutorial 11. Topic Review. ECE521 Winter Credits to Alireza Makhzani, Alex Schwing, Rich Zemel and TAs for slides. ECE521 Tutorial 11 / 4

Parametric Unsupervised Learning Expectation Maximization (EM) Lecture 20.a

Performance Comparison of K-Means and Expectation Maximization with Gaussian Mixture Models for Clustering EE6540 Final Project

Variational Bayesian Dirichlet-Multinomial Allocation for Exponential Family Mixtures

Graphical models: parameter learning

Clustering bi-partite networks using collapsed latent block models

Outline. Binomial, Multinomial, Normal, Beta, Dirichlet. Posterior mean, MAP, credible interval, posterior distribution

Latent Variable Models

Foundations of Statistical Inference

STA 4273H: Statistical Machine Learning

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Variational Bayes. A key quantity in Bayesian inference is the marginal likelihood of a set of data D given a model M

Probabilistic Time Series Classification

A brief introduction to Conditional Random Fields

Stochastic Variational Inference

Learning Gaussian Graphical Models with Unknown Group Sparsity

Lecture 9: PGM Learning

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Gibbs Sampling Methods for Multiple Sequence Alignment

Dynamic Approaches: The Hidden Markov Model

COS513 LECTURE 8 STATISTICAL CONCEPTS

Undirected Graphical Models

Transcription:

Deciphering and modeling heterogeneity in interaction networks (using variational approximations) S. Robin INRA / AgroParisTech Mathematical Modeling of Complex Systems December 2013, Ecole Centrale de Paris

Outline State-space models for networks (inc. SBM) Variational inference (inc. SBM) (Variational) Bayesian model averaging Towards W -graphs

Heterogeneity in interaction networks

Understanding network structure Networks describe interactions between entities. Observed networks display heterogeneous topologies, that one would like to decipher and better understand.

Understanding network structure Networks describe interactions between entities. Observed networks display heterogeneous topologies, that one would like to decipher and better understand. Dolphine social network. Hyperlink network. [Newman and Girvan (2004)]

Modelling network heterogeneity Latent variable models allow to capture the underlying structure of a network. General setting for binary graphs. [Bollobás et al. (2007)]: A latent (unobserved) variable Z i is associated with each node: {Z i } iid π Edges Y ij = I{i j} are independent conditionally to the Z i s: {Y ij } independent {Z i } : Pr{Y ij = 1} = γ(z i, Z j ) We focus here on model approaches, in contrast with, e.g. Graph clustering [Girvan and Newman (2002)], [Newman (2004)]; Spectral clustering [von Luxburg et al. (2008)].

Modelling network heterogeneity Latent variable models allow to capture the underlying structure of a network. General setting for binary graphs. [Bollobás et al. (2007)]: A latent (unobserved) variable Z i is associated with each node: {Z i } iid π Edges Y ij = I{i j} are independent conditionally to the Z i s: {Y ij } independent {Z i } : Pr{Y ij = 1} = γ(z i, Z j ) We focus here on model approaches, in contrast with, e.g. Graph clustering [Girvan and Newman (2002)], [Newman (2004)]; Spectral clustering [von Luxburg et al. (2008)].

Modelling network heterogeneity Latent variable models allow to capture the underlying structure of a network. General setting for binary graphs. [Bollobás et al. (2007)]: A latent (unobserved) variable Z i is associated with each node: {Z i } iid π Edges Y ij = I{i j} are independent conditionally to the Z i s: {Y ij } independent {Z i } : Pr{Y ij = 1} = γ(z i, Z j ) We focus here on model approaches, in contrast with, e.g. Graph clustering [Girvan and Newman (2002)], [Newman (2004)]; Spectral clustering [von Luxburg et al. (2008)].

Latent space models State-space model: principle.

Latent space models State-space model: principle. Consider n nodes (i = 1..n);

Latent space models State-space model: principle. Consider n nodes (i = 1..n); Z i = unobserved position of node i, e.g. {Z i } iid N (0, I )

Latent space models State-space model: principle. Consider n nodes (i = 1..n); Z i = unobserved position of node i, e.g. {Z i } iid N (0, I ) Edge {Y ij } independent given {Z i }, e.g. Pr{Y ij = 1} = γ(z i, Z j ) γ(z, z ) = f ( z, z ).

Latent space models State-space model: principle. Consider n nodes (i = 1..n); Z i = unobserved position of node i, e.g. {Z i } iid N (0, I ) Y = 0 1 1 0 1... 0 0 1 0 1... 0 0 0 0 0... 0 0 0 0 1... 0 0 0 0 0........... Edge {Y ij } independent given {Z i }, e.g. Pr{Y ij = 1} = γ(z i, Z j ) γ(z, z ) = f ( z, z ).

A variety of state-space models Continuous. Latent position models. [Hoff et al. (2002)]: Z i R d, logit[γ(z, z )] = a z z [Handcock et al. (2007)]: Z i k p k N d (µ k, σ 2 k I ) [Lovász and Szegedy (2006)]: Z i U [0,1], γ(z, z ) : [0, 1] 2 [0, 1] =: graphon function [Daudin et al. (2010)]: Z i S K, γ(z, z ) = k,l z k z l γ kl

Stochastic Block Model (SBM) A mixture model for random graphs. [Nowicki and Snijders (2001)]

Stochastic Block Model (SBM) A mixture model for random graphs. [Nowicki and Snijders (2001)] Consider n nodes (i = 1..n);

Stochastic Block Model (SBM) A mixture model for random graphs. [Nowicki and Snijders (2001)] Consider n nodes (i = 1..n); Z i = unobserved label of node i: {Z i } iid M(1; π) π = (π 1,...π K );

Stochastic Block Model (SBM) A mixture model for random graphs. [Nowicki and Snijders (2001)] Consider n nodes (i = 1..n); Z i = unobserved label of node i: {Z i } iid M(1; π) π = (π 1,...π K ); Edge Y ij depends on the labels: {Y ij } independent given {Z i }, Pr{Y ij = 1} = γ(z i, Z j )

Stochastic Block Model (SBM) A mixture model for random graphs. [Nowicki and Snijders (2001)] Consider n nodes (i = 1..n); Z i = unobserved label of node i: {Z i } iid M(1; π) π = (π 1,...π K ); Edge Y ij depends on the labels: {Y ij } independent given {Z i }, Pr{Y ij = 1} = γ(z i, Z j )

Variational inference Joint work with J.-J. Daudin, S. Gazal, F. Picard [Daudin et al. (2008)], [Gazal et al. (2012)]

Incomplete data models Aim. Based on the observed network Y = (Y ij ), we want to infer the parameters θ = (π, γ) the hidden states Z = (Z i ) State space models belong to the class of incomplete data models as the edges (Y ij ) are observed, the latent positions (or status) (Z i ) are not. usual issue in unsupervised classification.

Incomplete data models Aim. Based on the observed network Y = (Y ij ), we want to infer the parameters θ = (π, γ) the hidden states Z = (Z i ) State space models belong to the class of incomplete data models as the edges (Y ij ) are observed, the latent positions (or status) (Z i ) are not. usual issue in unsupervised classification.

Frequentist maximum likelihood inference Likelihood. The (log-)likelihood log P(Y ; θ) = log Z P(Y, Z; θ) can not be computed. EM trick. But it can be decomposed as log P(Y ; θ) = log P(Y, Z; θ) log P(Z Y ; θ), the conditional expectation of which gives E[log P(Y ; θ) Y ] = E[log P(Y, Z; θ) Y ] E[log P(Z Y ; θ) Y ] log P(Y ; θ) = E[log P(Y, Z; θ) Y ] + H[P(Z Y ; θ)] where H stands for the entropy.

Frequentist maximum likelihood inference Likelihood. The (log-)likelihood log P(Y ; θ) = log Z P(Y, Z; θ) can not be computed. EM trick. But it can be decomposed as log P(Y ; θ) = log P(Y, Z; θ) log P(Z Y ; θ), the conditional expectation of which gives E[log P(Y ; θ) Y ] = E[log P(Y, Z; θ) Y ] E[log P(Z Y ; θ) Y ] log P(Y ; θ) = E[log P(Y, Z; θ) Y ] + H[P(Z Y ; θ)] where H stands for the entropy.

Frequentist maximum likelihood inference Likelihood. The (log-)likelihood log P(Y ; θ) = log Z P(Y, Z; θ) can not be computed. EM trick. But it can be decomposed as log P(Y ; θ) = log P(Y, Z; θ) log P(Z Y ; θ), the conditional expectation of which gives E[log P(Y ; θ) Y ] = E[log P(Y, Z; θ) Y ] E[log P(Z Y ; θ) Y ] log P(Y ; θ) = E[log P(Y, Z; θ) Y ] + H[P(Z Y ; θ)] where H stands for the entropy.

Frequentist maximum likelihood inference Likelihood. The (log-)likelihood log P(Y ; θ) = log Z P(Y, Z; θ) can not be computed. EM trick. But it can be decomposed as log P(Y ; θ) = log P(Y, Z; θ) log P(Z Y ; θ), the conditional expectation of which gives E[log P(Y ; θ) Y ] = E[log P(Y, Z; θ) Y ] E[log P(Z Y ; θ) Y ] log P(Y ; θ) = E[log P(Y, Z; θ) Y ] + H[P(Z Y ; θ)] where H stands for the entropy.

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

EM algorithm Aims at maximizing the log-likelihood log P(Y ; θ) through the alternation of two steps [Dempster et al. (1977)] M-step: maximize E[log P(Y, Z; θ) Y ] with respect to θ generally similar to standard MLE. E-step: calculate P(Z Y ) (at least, some moments) sometimes straightforward (independent mixture models: Bayes formula) sometimes tricky but doable (HMMs: forward-backward recursion) sometimes impossible (SBM:...)

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ)

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ)

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ)

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ)

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ) but we can not for state space models for graph.

Conditional distribution of Z EM works provided that we can calculate P(Z Y ; θ) but we can not for state space models for graph. Conditional distribution. The dependency graph of Z given Y is a clique. No factorization can be hoped (unlike for HMM). P(Z Y ; θ) can not be computed (efficiently). Variational techniques may help as they provide Q(Z) P(Z Y ).

Variational inference Lower bound of the log-likelihood. For any distribution Q(Z) [Jaakkola (2000),Wainwright and Jordan (2008)], log P(Y ) log P(Y ) KL[Q(Z), P(Z Y )] = Q(Z) log P(Y, Z) dz Q(Z) log Q(Z) dz = E Q [log P(Y, Z)] + H[Q(Z)] Link with EM. This is similar to log P(Y ) = E[log P(Y, Z) Y ] + H[P(Z Y )] replacing P(Z Y ) with Q(Z).

Variational inference Lower bound of the log-likelihood. For any distribution Q(Z) [Jaakkola (2000),Wainwright and Jordan (2008)], log P(Y ) log P(Y ) KL[Q(Z), P(Z Y )] = Q(Z) log P(Y, Z) dz Q(Z) log Q(Z) dz = E Q [log P(Y, Z)] + H[Q(Z)] Link with EM. This is similar to log P(Y ) = E[log P(Y, Z) Y ] + H[P(Z Y )] replacing P(Z Y ) with Q(Z).

Variational inference Lower bound of the log-likelihood. For any distribution Q(Z) [Jaakkola (2000),Wainwright and Jordan (2008)], log P(Y ) log P(Y ) KL[Q(Z), P(Z Y )] = Q(Z) log P(Y, Z) dz Q(Z) log Q(Z) dz = E Q [log P(Y, Z)] + H[Q(Z)] Link with EM. This is similar to log P(Y ) = E[log P(Y, Z) Y ] + H[P(Z Y )] replacing P(Z Y ) with Q(Z).

Variational inference Lower bound of the log-likelihood. For any distribution Q(Z) [Jaakkola (2000),Wainwright and Jordan (2008)], log P(Y ) log P(Y ) KL[Q(Z), P(Z Y )] = Q(Z) log P(Y, Z) dz Q(Z) log Q(Z) dz = E Q [log P(Y, Z)] + H[Q(Z)] Link with EM. This is similar to log P(Y ) = E[log P(Y, Z) Y ] + H[P(Z Y )] replacing P(Z Y ) with Q(Z).

Variational EM algorithm Variational EM. M-step: compute θ = arg max E Q [log P(Y, Z; θ)]. θ E-step: replace the calculation of P(Z Y ) with the search of Q (Z) = arg min KL[Q(Z), P(Z Y )]. Q Q Taking Q = {all possible distributions} gives Q (Z) = P(Z Y )... like EM does. Variational approximations rely on the choice a set Q of good and tractable ditributions.

Variational EM algorithm Variational EM. M-step: compute θ = arg max E Q [log P(Y, Z; θ)]. θ E-step: replace the calculation of P(Z Y ) with the search of Q (Z) = arg min KL[Q(Z), P(Z Y )]. Q Q Taking Q = {all possible distributions} gives Q (Z) = P(Z Y )... like EM does. Variational approximations rely on the choice a set Q of good and tractable ditributions.

Variational EM algorithm Variational EM. M-step: compute θ = arg max E Q [log P(Y, Z; θ)]. θ E-step: replace the calculation of P(Z Y ) with the search of Q (Z) = arg min KL[Q(Z), P(Z Y )]. Q Q Taking Q = {all possible distributions} gives Q (Z) = P(Z Y )... like EM does. Variational approximations rely on the choice a set Q of good and tractable ditributions.

Variational EM for SBM Distribution class. Q = set of factorisable distributions: Q = {Q : Q(Z) = i Q i (Z i )}, Q i (Z i ) = k τ Z ik ik. The approximate joint distribution is Q(Z i, Z j ) = Q i (Z i )Q j (Z j ). The optimal approximation within this class satisfies a fix-point relation: τ ik π k f kl (Y ij ) τ jl j i l also known as mean-field approximation in physics [Parisi (1988)]. Variational estimates. No general statistical guaranty for variational estimates. SBM is a very specific case for which the variational approximation is asymptotically exact [Celisse et al. (2012),Mariadassou and Matias (2012)].

Variational EM for SBM Distribution class. Q = set of factorisable distributions: Q = {Q : Q(Z) = i Q i (Z i )}, Q i (Z i ) = k τ Z ik ik. The approximate joint distribution is Q(Z i, Z j ) = Q i (Z i )Q j (Z j ). The optimal approximation within this class satisfies a fix-point relation: τ ik π k f kl (Y ij ) τ jl j i l also known as mean-field approximation in physics [Parisi (1988)]. Variational estimates. No general statistical guaranty for variational estimates. SBM is a very specific case for which the variational approximation is asymptotically exact [Celisse et al. (2012),Mariadassou and Matias (2012)].

Variational EM for SBM Distribution class. Q = set of factorisable distributions: Q = {Q : Q(Z) = i Q i (Z i )}, Q i (Z i ) = k τ Z ik ik. The approximate joint distribution is Q(Z i, Z j ) = Q i (Z i )Q j (Z j ). The optimal approximation within this class satisfies a fix-point relation: τ ik π k f kl (Y ij ) τ jl j i l also known as mean-field approximation in physics [Parisi (1988)]. Variational estimates. No general statistical guaranty for variational estimates. SBM is a very specific case for which the variational approximation is asymptotically exact [Celisse et al. (2012),Mariadassou and Matias (2012)].

Variational Bayes inference Bayesian perspective. θ = (π, γ) is random. Model:

Variational Bayes inference Bayesian perspective. θ = (π, γ) is random. Model: P(θ)

Variational Bayes inference Bayesian perspective. θ = (π, γ) is random. Model: P(θ) P(Z π)

Variational Bayes inference Bayesian perspective. θ = (π, γ) is random. Model: P(θ) P(Z π) P(Y γ, Z)

Variational Bayes inference Bayesian perspective. θ = (π, γ) is random. Model: P(θ) P(Z π) P(Y γ, Z) Bayesian inference. The aim is then to get the joint conditional distribution of the parameters and of the hidden variables: P(θ, Z Y ).

Variational Bayes algorithm Variational Bayes. As P(θ, Z Y ) is intractable, one look formula Q (θ, Z) = arg min KL [Q(θ, Z) P(θ, Z Y )] Q Q taking, e.g., Q = {Q(θ, Z) = Q θ (θ)q Z (Z)} Variational Bayes EM (VBEM). When P(Z, Y θ) belongs to the exponential family, P(θ) is the corresponding conjugate prior, Q can be obtained iteratively as [Beal and Ghahramani (2003)] log Q h θ (θ) E Q h 1 Z [log P(Z, Y, θ)], log QZ h (Z) E Qθ h [log P(Z, Y, θ)]. Application to SBM: [Latouche et al. (2012)]

Variational Bayes algorithm Variational Bayes. As P(θ, Z Y ) is intractable, one look formula Q (θ, Z) = arg min KL [Q(θ, Z) P(θ, Z Y )] Q Q taking, e.g., Q = {Q(θ, Z) = Q θ (θ)q Z (Z)} Variational Bayes EM (VBEM). When P(Z, Y θ) belongs to the exponential family, P(θ) is the corresponding conjugate prior, Q can be obtained iteratively as [Beal and Ghahramani (2003)] log Q h θ (θ) E Q h 1 Z [log P(Z, Y, θ)], log QZ h (Z) E Qθ h [log P(Z, Y, θ)]. Application to SBM: [Latouche et al. (2012)]

Variational Bayes algorithm Variational Bayes. As P(θ, Z Y ) is intractable, one look formula Q (θ, Z) = arg min KL [Q(θ, Z) P(θ, Z Y )] Q Q taking, e.g., Q = {Q(θ, Z) = Q θ (θ)q Z (Z)} Variational Bayes EM (VBEM). When P(Z, Y θ) belongs to the exponential family, P(θ) is the corresponding conjugate prior, Q can be obtained iteratively as [Beal and Ghahramani (2003)] log Q h θ (θ) E Q h 1 Z [log P(Z, Y, θ)], log QZ h (Z) E Qθ h [log P(Z, Y, θ)]. Application to SBM: [Latouche et al. (2012)]

VBEM: Simulation study Credibility intervals: π 1 : +, γ 11 :, γ 12 :, γ 22 : Width of the posterior credibility intervals. π 1, γ 11, γ 12, γ 22

VBEM: Simulation study Credibility intervals: π 1 : +, γ 11 :, γ 12 :, γ 22 : Width of the posterior credibility intervals. π 1, γ 11, γ 12, γ 22

SBM analysis of E. coli operon networks [Picard et al. (2009)]

SBM analysis of E. coli operon networks Meta-graph representation. [Picard et al. (2009)]

SBM analysis of E. coli operon networks Meta-graph representation. Parameter estimates. K = 5 [Picard et al. (2009)]

(Variational) Bayesian model averaging Joint work with M.-L. Martin-Magniette, S. Volant [Volant et al. (2012)]

Model choice Model selection. The number of classes K generally needs to be estimated. In the frequentist setting, an approximate ICL criterion can be derived ICL = E[log P(Y, Z) Y ] 1 { K(K + 1) n(n 1) log (K 1) log n 2 2 2 In the Bayesian setting, exact versions of BIC and ICL criteria can be calculated as log P(Y, K), log P(Y, Z, K). (up to the variational approximation) [Latouche et al. (2012)] But, in some applications, it may be useless or meaningless and model averaging may be preferred.

Model choice Model selection. The number of classes K generally needs to be estimated. In the frequentist setting, an approximate ICL criterion can be derived ICL = E[log P(Y, Z) Y ] 1 { K(K + 1) n(n 1) log (K 1) log n 2 2 2 In the Bayesian setting, exact versions of BIC and ICL criteria can be calculated as log P(Y, K), log P(Y, Z, K). (up to the variational approximation) [Latouche et al. (2012)] But, in some applications, it may be useless or meaningless and model averaging may be preferred.

Bayesian model averaging (BMA) General principle. [Hoeting et al. (1999)] : a parameter that can be defined under a series of different models {M K } K. Denote P K ( Y ) its posterior distribution under model M K. The posterior distribution of can be averaged over all models as P( Y ) = K P(M K Y )P K ( Y ) Remarks. w k = P(M K Y ): weight given to model M K for the estimation of. Calculating of w K is not easy, but variational approximation may help.

Bayesian model averaging (BMA) General principle. [Hoeting et al. (1999)] : a parameter that can be defined under a series of different models {M K } K. Denote P K ( Y ) its posterior distribution under model M K. The posterior distribution of can be averaged over all models as P( Y ) = K P(M K Y )P K ( Y ) Remarks. w k = P(M K Y ): weight given to model M K for the estimation of. Calculating of w K is not easy, but variational approximation may help.

Variational Bayesian model averaging (VBMA) Variational Bayes formulation. M K can be viewed as one more hidden layer Variational Bayes then aims at finding Q (K, θ, Z) = arg min KL [Q(K, θ, Z) P(K, θ, Z Y )] Q Q with Q = {Q(θ, Z, K) = Q θ (θ K)Q Z (Z K)Q K (K)} 1 Optimal variational weights: Q K (K) P(K) exp{log P(Y K) KL[Q (Z, θ K); P(Z, θ Y, K)]} = P(K Y )e KL[Q (Z,θ K);P(Z,θ Y,K)]. 1 No additional approximation w.r.t. the regular VBEM.

Variational Bayesian model averaging (VBMA) Variational Bayes formulation. M K can be viewed as one more hidden layer Variational Bayes then aims at finding Q (K, θ, Z) = arg min KL [Q(K, θ, Z) P(K, θ, Z Y )] Q Q with Q = {Q(θ, Z, K) = Q θ (θ K)Q Z (Z K)Q K (K)} 1 Optimal variational weights: Q K (K) P(K) exp{log P(Y K) KL[Q (Z, θ K); P(Z, θ Y, K)]} = P(K Y )e KL[Q (Z,θ K);P(Z,θ Y,K)]. 1 No additional approximation w.r.t. the regular VBEM.

VBMA: the recipe For K = 1... K max Use regular VBEM to compute the approximate conditional posterior Q Z,θ K (Z, θ K) = Q Z K (Z K)Q θ K (θ K); Compute the conditional lower bound of the log-likelihood L K = log P(Y K) KL[Q (Z, θ K); P(Z, θ Y, K)]. Compute the approximate posterior of model M K : w K := Q K (K) P(K) exp(l K ). Deduce the approximate posterior of : P( Y ) K w K Q θ K ( K)

VBMA: the recipe For K = 1... K max Use regular VBEM to compute the approximate conditional posterior Q Z,θ K (Z, θ K) = Q Z K (Z K)Q θ K (θ K); Compute the conditional lower bound of the log-likelihood L K = log P(Y K) KL[Q (Z, θ K); P(Z, θ Y, K)]. Compute the approximate posterior of model M K : w K := Q K (K) P(K) exp(l K ). Deduce the approximate posterior of : P( Y ) K w K Q θ K ( K)

VBMA: the recipe For K = 1... K max Use regular VBEM to compute the approximate conditional posterior Q Z,θ K (Z, θ K) = Q Z K (Z K)Q θ K (θ K); Compute the conditional lower bound of the log-likelihood L K = log P(Y K) KL[Q (Z, θ K); P(Z, θ Y, K)]. Compute the approximate posterior of model M K : w K := Q K (K) P(K) exp(l K ). Deduce the approximate posterior of : P( Y ) K w K Q θ K ( K)

Towards W -graphs Joint work with P. Latouche. [Latouche and Robin (2013)]

W -graph model W graph. Latent variables: Graphon function γ(z, z ) (Z i ) iid U [0,1], Graphon function γ: γ(z, z ) : [0, 1] 2 [0, 1] Edges: Pr{Y ij = 1} = γ(z i, Z j )

Inference of the graphon function Probabilistic point of view. W -graph have been mostly studied in the probability literature: [Lovász and Szegedy (2006)], [Diaconis and Janson (2008)] Motif (sub-graph) frequencies are invariant characteristics of a W -graph. Intrinsic un-identifiability of the graphon function γ is often overcome by imposing that u γ(u, v) dv is monotonous increasing. Statistical point of view. Not much attention has been paid to its inference until very recently: [Chatterjee (2012)], [Airoldi et al. (2013)], [Wolfe and Olhede (2013)],... The two latter also uses SBM as a proxy for W -graph.

Inference of the graphon function Probabilistic point of view. W -graph have been mostly studied in the probability literature: [Lovász and Szegedy (2006)], [Diaconis and Janson (2008)] Motif (sub-graph) frequencies are invariant characteristics of a W -graph. Intrinsic un-identifiability of the graphon function γ is often overcome by imposing that u γ(u, v) dv is monotonous increasing. Statistical point of view. Not much attention has been paid to its inference until very recently: [Chatterjee (2012)], [Airoldi et al. (2013)], [Wolfe and Olhede (2013)],... The two latter also uses SBM as a proxy for W -graph.

SBM as a W -graph model Latent variables: (U i ) iid U[0, 1] Graphon function γ SBM K (z, z ) Z ik = I{σ k 1 U i < σ k } where σ k = k l=1 π l. Blockwise constant graphon: Edges: γ(z, z ) = γ kl Pr{Y ij = 1} = γ(z i, Z j )

Variational Bayes estimation of γ(z, z ) Posterior mean of γk SBM (z, z ) VBEM inference provides the approximate posteriors: (π Y ) Dir(π ) (γ kl Y ) Beta(γkl 0 kl ) Estimate of γ(u, v). γ SBM K (u, v) = Ẽ ( γ C(u),C(v) Y ) where C(u) = 1 + k I{σ k u}. [Gouda and Szántai (2010)]

Model averaging Model averaging: There is no true K in the W -graph model. Apply VBMA recipe. For K = 1..K max, fit an SBM model via VBEM and compute Perform model averaging as γ SBM K (z, z ) = E Q [γ C(z),C(z )]. γ(z, z ) = K w K γ SBM K (z, z ) where w K is the variational weights arising from variational Bayes inference.

Model averaging Model averaging: There is no true K in the W -graph model. Apply VBMA recipe. For K = 1..K max, fit an SBM model via VBEM and compute Perform model averaging as γ SBM K (z, z ) = E Q [γ C(z),C(z )]. γ(z, z ) = K w K γ SBM K (z, z ) where w K is the variational weights arising from variational Bayes inference.

Some simulations Design. Symetric graphon: γ(u, v) = ρλ 2 (uv) λ 1 Variational posterior for K: Q (K). λ : imbalanced graph ρ : dense graph Results. More complex models as n and λ Posterior fairly concentrated

French political blogosphere Website network. French political blogs: 196 nodes, 1432 edges.

French political blogosphere Infered graphon. Ŵ (u, v) = E(γ(u, v) Y ) 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Motif probability can be estimated as well as µ(m) = E(µ(m) Y ).

French political blogosphere Infered graphon. Ŵ (u, v) = E(γ(u, v) Y ) Motif probability can be estimated as well as µ(m) = E(µ(m) Y ).

Conclusion Real networks are heterogeneous. State space models (including SBM) allow to capture such an heterogeneity... but raise inference problems. Variational approximations help to deal with complex (conditional) dependency structures. SBM is a special (rare?) case for which guaranties exist as for the performances of the variational approximation. SBM can be used as a proxy for smoother models, such as W -graphs.

Conclusion Real networks are heterogeneous. State space models (including SBM) allow to capture such an heterogeneity... but raise inference problems. Variational approximations help to deal with complex (conditional) dependency structures. SBM is a special (rare?) case for which guaranties exist as for the performances of the variational approximation. SBM can be used as a proxy for smoother models, such as W -graphs.

Conclusion Real networks are heterogeneous. State space models (including SBM) allow to capture such an heterogeneity... but raise inference problems. Variational approximations help to deal with complex (conditional) dependency structures. SBM is a special (rare?) case for which guaranties exist as for the performances of the variational approximation. SBM can be used as a proxy for smoother models, such as W -graphs.

Conclusion Real networks are heterogeneous. State space models (including SBM) allow to capture such an heterogeneity... but raise inference problems. Variational approximations help to deal with complex (conditional) dependency structures. SBM is a special (rare?) case for which guaranties exist as for the performances of the variational approximation. SBM can be used as a proxy for smoother models, such as W -graphs.

Conclusion Real networks are heterogeneous. State space models (including SBM) allow to capture such an heterogeneity... but raise inference problems. Variational approximations help to deal with complex (conditional) dependency structures. SBM is a special (rare?) case for which guaranties exist as for the performances of the variational approximation. SBM can be used as a proxy for smoother models, such as W -graphs.

Airoldi, E. M., Costa, T. B. and Chan, S. H. (Nov., 2013). Stochastic blockmodel approximation of a graphon: Theory and consistent estimation. ArXiv e-prints. 1311.1731. Beal, J., M. and Ghahramani, Z. (2003). The variational Bayesian EM algorithm for incomplete data: with application to scoring graphical model structures. Bayes. Statist. 7 543 52. Bollobás, B., Janson, S. and Riordan, O. (2007). The phase transition in inhomogeneous random graphs. Rand. Struct. Algo. 31 (1) 3 122. Celisse, A., Daudin, J.-J. and Pierre, L. (2012). Consistency of maximum-likelihood and variational estimators in the stochastic block model. Electron. J. Statis. 6 1847 99. Chatterjee, S. (Dec., 2012). Matrix estimation by Universal Singular Value Thresholding. ArXiv e-prints. 1212.1247. Daudin, J.-J., Picard, F. and Robin, S. (Jun, 2008). A mixture model for random graphs. Stat. Comput. 18 (2) 173 83. Daudin, J.-J., Pierre, L. and Vacher, C. (2010). Model for heterogeneous random networks using continuous latent variables and an application to a tree fungus network. Biometrics. 66 (4) 1043 1051. Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. J. R. Statist. Soc. B. 39 1 38. Diaconis, P. and Janson, S. (2008). Graph limits and exchangeable random graphs. Rend. Mat. Appl. 7 (28) 33 61. Gazal, S., Daudin, J.-J. and Robin, S. (2012). Accuracy of variational estimates for random graph mixture models. Journal of Statistical Computation and Simulation. 82 (6) 849 862. Girvan, M. and Newman, M. E. J. (2002). Community strucutre in social and biological networks. Proc. Natl. Acad. Sci. USA. 99 (12) 7821 6. Gouda, A. and Szántai, T. (2010). On numerical calculation of probabilities according to Dirichlet distribution. Ann. Oper. Res. 177 185 200. DOI: 10.1007/s10479-009-0601-9.

Handcock, M., Raftery, A. and Tantrum, J. (2007). Model-based clustering for social networks. JRSSA. 170 (2) 301 54. doi: 10.1111/j.1467-985X.2007.00471.x. Hoeting, J. A., Madigan, D., Raftery, A. E. and Volinsky, C. T. (1999). Bayesian model averaging: A tutorial. Statistical Science. 14 (4) 382 417. Hoff, P. D., Raftery, A. E. and Handcock, M. S. (2002). Latent space approaches to social network analysis. J. Amer. Statist. Assoc. 97 (460) 1090 98. Jaakkola, T. (2000). Advanced mean field methods: theory and practice. chapter Tutorial on variational approximation methods. MIT Press. Latouche, P., Birmelé, E. and Ambroise, C. (2012). Variational bayesian inference and complexity control for stochastic block models. Statis. Model. 12 (1) 93 115. Latouche, P. and Robin, S. (Oct., 2013). Bayesian Model Averaging of Stochastic Block Models to Estimate the Graphon Function and Motif Frequencies in a W-graph Model. ArXiv e-prints. 1310.6150. Lovász, L. and Szegedy, B. (2006). Limits of dense graph sequences. Journal of Combinatorial Theory, Series B. 96 (6) 933 957. von Luxburg, U., Belkin, M. and Bousquet, O. (2008). Consistency of spectral clustering. Ann. Stat. 36 (2) 555 586. Mariadassou, M. and Matias, C. (June, 2012). Convergence of the groups posterior distribution in latent or stochastic block models. ArXiv e-prints. 1206.7101. Newman, M. and Girvan, M. (2004). Finding and evaluating community structure in networks,. Phys. Rev. E. 69 026113. Newman, M. E. J. (2004). Fast algorithm for detecting community structure in networks. Phys. Rev. E (69) 066133. Nowicki, K. and Snijders, T. (2001). Estimation and prediction for stochastic block-structures. J. Amer. Statist. Assoc. 96 1077 87. Parisi, G. (1988). Statistical Field Theory. Addison Wesley, New York),.

Picard, F., Miele, V., Daudin, J.-J., Cottret, L. and Robin, S. (2009). Deciphering the connectivity structure of biological networks using mixnet. BMC Bioinformatics. Suppl 6 S17. doi:10.1186/1471-2105-10-s6-s17. Volant, S., Magniette, M.-L. M. and Robin, S. (2012). Variational bayes approach for model aggregation in unsupervised classification with markovian dependency. Comput. Statis. & Data Analysis. 56 (8) 2375 2387. Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Found. Trends Mach. Learn. 1 (1 2) 1 305. http:/dx.doi.org/10.1561/2200000001. Wolfe, P. J. and Olhede, S. C. (Sep., 2013). Nonparametric graphon estimation. ArXiv e-prints. 1309.5936.

SBM for a binary social network Zachary data. Social binary network of friendship within a sport club. (%) γ kl k/l 1 2 3 4 1 100 53 16 16 2-12 0 7 3 - - 8 73 4 - - - 100 π l 9 38 47 6

Küllback-Leibler divergence Definition. KL[Q( ), P( )] = = Q(Z) log Q(Z) P(Z) dz Q(Z) log Q(Z) dz Q(Z) log P(Z) dz Some properties. Always positive. Null iff P = Q. Contrast consistent with maximum-likelihood inference. Not a distance, only a divergence.

Concentration of the degree distribution n = 100.

Concentration of the degree distribution n = 1000.

Concentration of the degree distribution n = 10000.