Algorithmic approaches to fitting ERG models

Similar documents
Assessing the Goodness-of-Fit of Network Models

Curved exponential family models for networks

Bayesian Inference for Contact Networks Given Epidemic Data

Consistency Under Sampling of Exponential Random Graph Models

c Copyright 2015 Ke Li

Statistical Models for Social Networks with Application to HIV Epidemiology

Inference in curved exponential family models for networks

MS&E 226: Small Data. Lecture 11: Maximum likelihood (v2) Ramesh Johari

Estimation for Dyadic-Dependent Exponential Random Graph Models

Sequential Importance Sampling for Bipartite Graphs With Applications to Likelihood-Based Inference 1

A COMPARATIVE ANALYSIS ON COMPUTATIONAL METHODS FOR FITTING AN ERGM TO BIOLOGICAL NETWORK DATA A THESIS SUBMITTED TO THE GRADUATE SCHOOL

Asymptotics and Extremal Properties of the Edge-Triangle Exponential Random Graph Model

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

i=1 h n (ˆθ n ) = 0. (2)

Basic math for biology

Exponential families also behave nicely under conditioning. Specifically, suppose we write η = (η 1, η 2 ) R k R p k so that

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

A Graphon-based Framework for Modeling Large Networks

Information in a Two-Stage Adaptive Optimal Design

General Regression Model

IV. Analyse de réseaux biologiques

Specification and estimation of exponential random graph models for social (and other) networks

A brief introduction to Conditional Random Fields

arxiv: v1 [stat.me] 3 Apr 2017

A Review of Pseudo-Marginal Markov Chain Monte Carlo

Review and continuation from last week Properties of MLEs

Assessing Goodness of Fit of Exponential Random Graph Models

Generalized Linear Models (1/29/13)

Logistic Regression. Will Monroe CS 109. Lecture Notes #22 August 14, 2017

Generalized Exponential Random Graph Models: Inference for Weighted Graphs

Statistical and Computational Phase Transitions in Planted Models

Chaos, Complexity, and Inference (36-462)

Lecture 7. Logistic Regression. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 11, 2016

Chaos, Complexity, and Inference (36-462)

Instability, Sensitivity, and Degeneracy of Discrete Exponential Families

Bootstrap and Parametric Inference: Successes and Challenges

Stat 451 Lecture Notes Markov Chain Monte Carlo. Ryan Martin UIC

Goodness of Fit of Social Network Models 1

Generalized Linear Models

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Statistical Model for Soical Network

Probabilistic Graphical Models

IE 303 Discrete-Event Simulation

Linear Models for Classification

High dimensional Ising model selection

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Delayed Rejection Algorithm to Estimate Bayesian Social Networks

On the Geometry of EM algorithms

Fractional Imputation in Survey Sampling: A Comparative Review

Qualifying Exam in Machine Learning

Optimization. Charles J. Geyer School of Statistics University of Minnesota. Stat 8054 Lecture Notes

Beyond ERGMs. Scalable methods for the statistical modeling of networks. David Hunter. Department of Statistics Penn State University

CS229 Supplemental Lecture notes

Introduction to graphical models: Lecture III

Generalized Linear Models Introduction

Machine Learning for NLP

1. Fisher Information

arxiv: v1 [stat.me] 1 Aug 2012

Lecture 6: Graphical Models: Learning

Learning MN Parameters with Alternative Objective Functions. Sargur Srihari

SMSTC: Probability and Statistics

Stat 710: Mathematical Statistics Lecture 12

Bayesian estimation of complex networks and dynamic choice in the music industry

ECO 513 Fall 2008 C.Sims KALMAN FILTER. s t = As t 1 + ε t Measurement equation : y t = Hs t + ν t. u t = r t. u 0 0 t 1 + y t = [ H I ] u t.

STA 4273H: Statistical Machine Learning

Outline of GLMs. Definitions

Lecture 6: Gaussian Mixture Models (GMM)

Undirected Graphical Models

Graduate Econometrics I: Maximum Likelihood I

Nonparametric Bayesian Methods (Gaussian Processes)

Stat 5421 Lecture Notes Proper Conjugate Priors for Exponential Families Charles J. Geyer March 28, 2016

Discussion of Missing Data Methods in Longitudinal Studies: A Review by Ibrahim and Molenberghs

Lecture 4 September 15

MIT Spring 2016

Goodness of Fit of Social Network Models

Machine Learning Practice Page 2 of 2 10/28/13

Phasing via the Expectation Maximization (EM) Algorithm

Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data

Estimating Latent Variable Graphical Models with Moments and Likelihoods

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Generalized Linear Models

MAP Examples. Sargur Srihari

Dynamic modeling of organizational coordination over the course of the Katrina disaster

CS839: Probabilistic Graphical Models. Lecture 7: Learning Fully Observed BNs. Theo Rekatsinas

CPSC 540: Machine Learning

Association studies and regression

σ(a) = a N (x; 0, 1 2 ) dx. σ(a) = Φ(a) =

Support Vector Machines

Using contrastive divergence to seed Monte Carlo MLE for exponential-family random graph models

Introduction to Machine Learning CMU-10701

Estimation theory and information geometry based on denoising

Likelihood-based inference with missing data under missing-at-random

CSC 412 (Lecture 4): Undirected Graphical Models

6. Fractional Imputation in Survey Sampling

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Infinitely Imbalanced Logistic Regression

Algebraic Statistics progress report

Using Potential Games to Parameterize ERG Models

Modeling tie duration in ERGM-based dynamic network models

Transcription:

Ruth Hummel, Penn State University Mark Handcock, University of Washington David Hunter, Penn State University Research funded by Office of Naval Research Award No. N00014-08-1-1015 MURI meeting, April 24, 2009

Outline Introduction 1 Introduction 2 3 4

The class of Exponential-family Random Graph Models (ERGMs): Definition where P η (Y = y) = exp{ηt g(y)} κ(η) Y is a random network written as an adjacency matrix so that Y ij is the indicator of an edge from i to j; g(y) is a vector of the network statistics of interest; η is a vector of parameters corresponding to the vector g(y); κ(η) is the constant of proportionality which makes the probabilities sum to one; intractable

Loglikelihood Introduction The loglikelihood for this class of models is l(η) = η t g(y obs ) log z Y exp(η t g(z)). (1) which can be written: l(η) l(η 0 ) = (η η 0 ) t g(y obs ) log E η0 [ exp { (η η0 ) t g(y) }], (2)

Maximum Pseudolikelihood Estimation Notation: For a network y and a pair (i, j) of nodes, y ij = 0 or 1, depending on whether there is an edge y c ij denotes the status of all pairs in y other than (i, j) y + ij denotes the same network as y but with y ij = 1 y ij denotes the same network as y but with y ij = 0 Assume no dependence among the Y ij. In other words, assume P(Y ij = 1) = P(Y ij = 1 Yij c = y c Then some algebra gives log P(Y ij = 1) [ ] P(Y ij = 0) = θt g(y + ij ) g(y ij ), so θ is estimated by straightforward logistic regression. Result: The maximum pseudolikelihood estimate. ij ).

Maximum Pseudolikelihood Estimation Notation: For a network y and a pair (i, j) of nodes, y ij = 0 or 1, depending on whether there is an edge y c ij denotes the status of all pairs in y other than (i, j) y + ij denotes the same network as y but with y ij = 1 y ij denotes the same network as y but with y ij = 0 Assume no dependence among the Y ij. In other words, assume P(Y ij = 1) = P(Y ij = 1 Yij c = y c Then some algebra gives log P(Y ij = 1) [ ] P(Y ij = 0) = θt g(y + ij ) g(y ij ), so θ is estimated by straightforward logistic regression. Result: The maximum pseudolikelihood estimate. ij ).

MPLE s behavior MLE (maximum likelihood estimation): Well-established method but very hard because the normalizing constant κ(α) is difficult (usually impossible) to evaluate, so we approximate it instead. MPLE (maximum pseudo-likelihood estimation): Easy to do using logistic regression, but based on an independence assumption that is often not justified. Several authors, notably van Duijn et al. (2009), argue forcefully against the use of MPLE.

Back to the loglikelihood Remember that the loglikelihood l(η) = η t g(y obs ) log z Y exp(η t g(z)). (3) can be written: l(η) l(η 0 ) = (η η 0 ) t g(y obs ) log E η0 [ exp { (η η0 ) t g(y) }], ( This leads to our first approximation for the MLE: MCMC MLE idea: Pick θ 0, draw Y 1,..., Y m from this model using MCMC, then approximate the population mean above by a sample mean. We can take θ 0 to be, for example, the MPLE.

Back to the loglikelihood Remember that the loglikelihood l(η) = η t g(y obs ) log z Y exp(η t g(z)). (3) can be written: l(η) l(η 0 ) = (η η 0 ) t g(y obs ) log E η0 [ exp { (η η0 ) t g(y) }], ( This leads to our first approximation for the MLE: MCMC MLE idea: Pick θ 0, draw Y 1,..., Y m from this model using MCMC, then approximate the population mean above by a sample mean. We can take θ 0 to be, for example, the MPLE.

Back to the loglikelihood Remember that the loglikelihood l(η) = η t g(y obs ) log z Y exp(η t g(z)). (3) can be written: l(η) l(η 0 ) = (η η 0 ) t g(y obs ) log E η0 [ exp { (η η0 ) t g(y) }], ( This leads to our first approximation for the MLE: MCMC MLE idea: Pick θ 0, draw Y 1,..., Y m from this model using MCMC, then approximate the population mean above by a sample mean. We can take θ 0 to be, for example, the MPLE.

Existence of MLE? There are two issues we need to be concerned with when using MCMC MLE: First, if the observed g(y obs ) is not in the interior of the convex hull of the sampled (MCMC generated) g(y i ), then no maximizer of the approximate loglikelihood ratio exists. Also, the approximation of the likelihood surface, l(η) l(η 0 ) (η η 0 ) t g(y obs ) log 1 m m exp((η η 0 ) t g(y i )), ( i=1 is not very good when we get far from η 0.

Quick Erdős-Rényi illustration The Erdős-Rényi model can be written as an ERGM, if g(y) is the number of edges of y. Since each edge exists independently with probability b, P(Y = y) = (b) g(y) (1 b) n g(y) = ( ) b g(y) (1 b) n. 1 b

Comparison of the true loglikelihood ratio and its estimation for an Erdos-Renyi graph True and estimated loglikelihood ratios plotted for eta= 0.69 (binomial probability = 1/3) True and estimated loglikelihood ratios plotted for eta= 0.69 (binomial probability = 1/3) LLR value 300 200 100 0 100 True LLR Estimated LLR true parameter MLE starting value LLR value 300 200 100 0 100 True LLR Estimated LLR true parameter MLE starting value 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 binomial probability (eta0= 0.6 (binomial probability = 0.354, MLE is 0.55 (binomial probability = 0.365) binomial probability (eta0= 1 (binomial probability = 0.269, MLE is 0.55 (binomial probability = 0.365)

Outline Introduction 1 Introduction 2 3 4

Mean value parametrization 1 We would like to move our arbitrary initial value η o close enough to the MLE that samples (of the statistics) generated from η 0 cover the observed statistics, g(y obs ). 2 New Idea: Intuitively, then, we might try to find a way to make the journey from η 0 to the MLE in pieces. In order to take steps toward the MLE, we need to have some idea where we are going. We obviously don t know where the MLE is, but we do know that the MLE has the following property: Definition The MLE, if it exists, is the unique parameter vector η satisfying E η g(y) = g(y obs ) = ˆξ. This is true for any exponential family (Barndorff-Nielson, 1978, and Brown, 1986).

Mean value parametrization 1 We would like to move our arbitrary initial value η o close enough to the MLE that samples (of the statistics) generated from η 0 cover the observed statistics, g(y obs ). 2 New Idea: Intuitively, then, we might try to find a way to make the journey from η 0 to the MLE in pieces. In order to take steps toward the MLE, we need to have some idea where we are going. We obviously don t know where the MLE is, but we do know that the MLE has the following property: Definition The MLE, if it exists, is the unique parameter vector η satisfying E η g(y) = g(y obs ) = ˆξ. This is true for any exponential family (Barndorff-Nielson, 1978, and Brown, 1986).

Mean value parametrization 1 We would like to move our arbitrary initial value η o close enough to the MLE that samples (of the statistics) generated from η 0 cover the observed statistics, g(y obs ). 2 New Idea: Intuitively, then, we might try to find a way to make the journey from η 0 to the MLE in pieces. In order to take steps toward the MLE, we need to have some idea where we are going. We obviously don t know where the MLE is, but we do know that the MLE has the following property: Definition The MLE, if it exists, is the unique parameter vector η satisfying E η g(y) = g(y obs ) = ˆξ. This is true for any exponential family (Barndorff-Nielson, 1978, and Brown, 1986).

ˆη =?? η 0 = given Ω Ξ ξ 0 ˆξ = g(y obs ) Ω is the original η parameter space (defined for the change statistics) Ξ is the corresponding mean value parameter space η 0 is the (given) initial value of η in the Markov Chain ξ 0 is the vector of mean statistics corresponding to η 0 ˆη is the (unknown) MLE ˆξ is the observed statistics, g(y obs ), which is the corresponding mean statistics vector for ˆη.

Taking steps Introduction We now want to move toward ˆξ = g(y obs ) in mean value parameter space. Here we specify a step length, 0 γ 1, as a fraction of the distance toward ˆξ that we want to traverse. We move the fraction γ toward ˆξ and call this point ξ 1 = γˆξ + (1 γ) ˆξ 0. At each step, we choose this fraction to be the biggest move toward g(y obs ) that does not leave the convex hull of the current sample.

Taking steps Introduction We now want to move toward ˆξ = g(y obs ) in mean value parameter space. Here we specify a step length, 0 γ 1, as a fraction of the distance toward ˆξ that we want to traverse. We move the fraction γ toward ˆξ and call this point ξ 1 = γˆξ + (1 γ) ˆξ 0. At each step, we choose this fraction to be the biggest move toward g(y obs ) that does not leave the convex hull of the current sample.

Finding the MCMC MLE Next, using the sample from the model defined by η 0, we maximize the approximate loglikelihood l(η) l(η 0 ) (η η 0 ) t g(y obs ) log 1 m m exp((η η 0 ) t g(y i )), (6) i=1 but with ξ 1 substituted in place of g(y obs ). The resulting maximizer will be called η 1. The process then repeats, with η 1 taking the place of η 0.

in Mean Value Parameter space ˆη =?? η 0 = given 1 ξ 0 ˆξ = g(y obs ) 0. Set η 0. 1. Take an MCMC sample from the model defined by η = η 0 to get ξ 0.

ˆη =?? η 0 = given 1 ξ 1 ˆξ = ξ 0 g(y obs ) 2 2. Go γ% toward ˆξ = g(y obs ) in mean value parameter space. Call this ξ 1.

η 0 = given η 1 ˆη =?? 1 3 ξ 1 ξ 0 ˆξ = g(y obs ) 2 3. Use η 0 in MCMC MLE to find the η that corresponds to ξ 1. Call this η 1.

η 1 ˆη =?? η 0 = given 3 1 4 ˆξ ξ 1 1 ξ 0 ˆξ = g(y obs ) 2 4. Re-estimate ξ 1 from an MCMC sample from the model defined by η = η 1. Call this ˆξ 1.

η 1 ˆη =?? η 0 = given 3 1 4 ˆξ ξ 1 1 ξ 0 5 ξ 2 ˆξ = g(yobs ) 2 5. Repeat step (2) by going γ% toward ˆξ = g(y obs ) from ˆξ 1. Call this ξ 2. Also repeat steps (3) and (4) to obtain ˆξ 2. Keep going until g(y obs ) is in the convex hull of the new sample.

η 1 ˆη =?? η 0 = given η i 6 3 1 etc. 4 ˆξ ξ 1 1 ξ 0 5 ξ 2 ˆξ = g(y obs ) 2 6. Use MCMC MLE with η i as the initial value to find ˆη. If we have made it into the appropriate neighborhood around ˆη, this will now be possible.

In review: Introduction In other words, for each t 0, we first use MCMC to draw a random sample Y 1,..., Y m from the model determined by η t, then we set ˆξ t = 1 m m g(y i ); i=1 ξ t+1 = γ tˆξ + (1 γt ) ˆξ t ; { [ η t+1 = arg max (η η 0 ) t 1 m ξ t+1 log exp { (η η 0 ) t g(y i ) }]} η m We iterate until ˆξ = g(y obs ) is in the convex hull of the statistics generated from ˆξ t. i=1

A second approximation: the lognormal Here, the ERGM is very simple, g(y) = edges. In other words, the model is binomial. Nevertheless, naive MCMC approximation of the log likelihood ratio is not good far from θ 0, even for gigantic samples. One possible remedy: Assume (θ θ 0 ) t g(y ) is normally distributed in l(θ) l(θ 0 ) = (θ θ 0 ) t g(y obs ) log E θ0 [ exp { (θ θ0 ) t g(y ) }].

Outline Introduction 1 Introduction 2 3 4

Example Bionet: E. Coli (Salgado et al 2001) A node is an operon Edge A B means A encodes a transcription factor that regulates B. Green indicates self-regulation

Another ERGM for the E. Coli network We fit a model similar to that of Saul and Filkov (2007) Term(s) Description: Edges Number of Edges 2-Deg,..., 5-Deg Nodes with degree 2,..., 5 GWDeg Single statistic: Weighted sum of 1-Deg,..., (n 1)-Deg with weights tending to 1 at a geometric rate model <- ergm(ecoli2 ~ edges + degree(2:5) + gwdegree(0.25, fixed=true), MPLEonly=TRUE) MPLE edges degree2 degree3 degree4 degree5 gwdegree -5.35-2.58-3.06-2.39-1.85 8.13

Another ERGM for the E. Coli network We fit a model similar to that of Saul and Filkov (2007) Term(s) Description: Edges Number of Edges 2-Deg,..., 5-Deg Nodes with degree 2,..., 5 GWDeg Single statistic: Weighted sum of 1-Deg,..., (n 1)-Deg with weights tending to 1 at a geometric rate model <- ergm(ecoli2 ~ edges + degree(2:5) + gwdegree(0.25, fixed=true), MPLEonly=TRUE) MPLE edges degree2 degree3 degree4 degree5 gwdegree -5.35-2.58-3.06-2.39-1.85 8.13

MPLE fit is degenerate With the MPLE we encounter problems: Here is a time-series plot of the edge-count of the 25 networks generated from the MPLE: 0 20000 40000 60000 80000 Max possible Observed 5 10 15 20 25

A sample from the MPLE-fitted model Iteration 1: Yellow = 0.06(Red) + 0.94(Green) 460 500 25 40 55 20 50 80 edges 500 800 20 50 80 25 40 55 460 500 degree2 degree3 degree4 degree5 15 30 10 40 gwdegree 500 900 15 30 10 40 Red: g(y obs ) Green: Sample mean Theory: No maximizer of the approximated likelihood exists because g(y obs ) is not in the interior of the convex hull of the sampled points. However, the likelihood depends on the data only through g(y obs ) Idea: What if we pretend g(y obs ) is the yellow point?

A sample from the MPLE-fitted model Iteration 1: Yellow = 0.06(Red) + 0.94(Green) 460 500 25 40 55 20 50 80 edges 500 800 20 50 80 25 40 55 460 500 degree2 degree3 degree4 degree5 15 30 10 40 gwdegree 500 900 15 30 10 40 Red: g(y obs ) Green: Sample mean Theory: No maximizer of the approximated likelihood exists because g(y obs ) is not in the interior of the convex hull of the sampled points. However, the likelihood depends on the data only through g(y obs ) Idea: What if we pretend g(y obs ) is the yellow point?

A sample from a model with a better θ 0 Iteration 2: Yellow = 0.08(Red) + 0.92(Green) 20 50 80 30 50 460 500 edges 500 800 20 50 80 30 50 degree2 degree3 degree4 degree5 20 35 10 40 For θ 0 we have replaced the MPLE by the MCMC MLE obtained by pretending that g(y obs ) was the yellow point on the previous slide. 460 500 500 800 gwdegree 20 30 40 10 40

A sample from a model with a better θ 0 Iteration 3: Yellow = 0.18(Red) + 0.82(Green) 30 60 25 40 55 460 490 520 edges 500 800 30 60 degree2 25 40 55 degree3 degree4 20 35 Continue to iterate... degree5 10 30 50 460 500 500 800 gwdegree 20 35 10 30 50

A sample from a model with a better θ 0 Iteration 4: Yellow = 0.36(Red) + 0.64(Green) 40 60 80 20 35 50 460 490 edges 600 900 40 60 80 degree2 degree3 15 30 45 20 35 50 degree4 degree5 10 30 460 490 600 900 gwdegree 15 30 45 10 30

A sample from a model with a better θ 0 Iteration 5: Yellow = 0.53(Red) + 0.47(Green) 50 70 90 25 35 45 460 490 50 70 90 edges degree2 500 700 degree3 20 30 40 25 40 degree4 degree5 5 15 30 460 490 gwdegree 500 700 20 30 40 5 15 30

A sample from a model with a better θ 0 Iteration 6: Yellow = 1(Red) + 0(Green) 60 80 100 10 25 40 440 470 edges 450 650 60 80 10 25 40 degree2 degree3 degree4 20 30 40 Finally, we don t need to pretend; the true g(y obs ) is actually interior to the convex hull of sampled points... degree5 5 15 440 470 450 650 gwdegree 20 30 40 5 15

Introduction A sample from a model with a better θ0 Final Iteration (#7): Green = mean, Yellow=observed 90 10 25 40 440 470 700 60 90 400 edges 15 30 45 60 degree2 10 25 40 degree3 20 degree4... so now we can take a larger sample and get a better final estimate of θ. 440 470 0 10 degree5 gwdegree 400 600 15 30 45 0 10 20

Finally, an MLE Original MPLE: edges degree2 degree3 degree4 degree5 gwdegree -5.35-2.58-3.06-2.39-1.85 8.13 Final (approximated) MLE: edges degree2 degree3 degree4 degree5 gwdegree -5.06-1.45-2.35-2.28-2.91 1.77 We know this is close to the true MLE ˆθ because the true MLE uniquely gives Eˆθg(Y ) = g(y obs ).

Failure of first approximation in this example Iteration 5: Yellow = 0.13(Red) + 0.87(Green) 0 40 80 0 10 25 350 450 edges 400 1400 0 40 80 degree2 0 10 25 degree3 degree4 0 15 30 A steplength of 0.01 is too small in this case. degree5 0 2 4 6 350 500 400 1200 gwdegree 0 15 30 0 2 4 6

Conclusions Introduction MPLE looks increasingly dangerous; it can mask problems when they exist and miss badly when they don t Naive MCMC MLE may not perform well even in very simple problems, but it may be modified. Here, we had success in a hard problem using two ideas: (a) toward the MLE in mean-value parameter space; (b) A log-normal approximation to the normalizing constant. By making MLE more and more automatic, we hope that scientists will be able to focus on modeling, not programming.

Outline Introduction 1 Introduction 2 3 4

Cited Alon, U. (2007, Nature Reviews Genetics), Network Motifs: Theory and Experimental Approaches. Barndorff-Nielsen, O. (1978). Information and exponential families in statistical theory, New York: Wiley series in probability and mathematical statistics. Brown, L. D. (1986). Fundamentals of statistical Exponential Families, Hayward: Institute of Mathematical Statistics. Saul ZM and Filkov V (2007, Bioinformatics), Exploring biological network structure using exponential random graph models. Salgado H et al. (2001, Nucleic Acids Res.), Regulondb (version 3.2): Transcriptional regulation and operon organization in Escherichia Coli k-12. van Duijn MAJ, Gile K, and Handcock MS (2009, Social Networks), A Framework for the Comparison of Maximum Pseudo-Likelihood and Maximum Likelihood Estimation of Exponential Family Random Graph Models.

Thank You!