Supporting Statistical Hypothesis Testing Over Graphs

Similar documents
Tied Kronecker Product Graph Models to Capture Variance in Network Populations

Lifted and Constrained Sampling of Attributed Graphs with Generative Network Models

How to exploit network properties to improve learning in relational domains

A Shrinkage Approach for Modeling Non-Stationary Relational Autocorrelation

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Using Bayesian Network Representations for Effective Sampling from Generative Network Models

Semi-supervised learning for node classification in networks

Sampling of Attributed Networks from Hierarchical Generative Models

Quilting Stochastic Kronecker Graphs to Generate Multiplicative Attribute Graphs

Collective classification in large scale networks. Jennifer Neville Departments of Computer Science and Statistics Purdue University

Quilting Stochastic Kronecker Product Graphs to Generate Multiplicative Attribute Graphs

RaRE: Social Rank Regulated Large-scale Network Embedding

Jure Leskovec Joint work with Jaewon Yang, Julian McAuley

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Overlapping Communities

How do we compare the relative performance among competing models?

Hybrid Models for Text and Graphs. 10/23/2012 Analysis of Social Media

Parameter estimators of sparse random intersection graphs with thinned communities

Consistency Under Sampling of Exponential Random Graph Models

Data Mining. CS57300 Purdue University. Bruno Ribeiro. February 8, 2018

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Controlling for confounding network properties in hypothesis testing and anomaly detection

arxiv: v1 [cs.si] 13 Nov 2014

Performance Evaluation and Comparison

Networks as vectors of their motif frequencies and 2-norm distance as a measure of similarity

Data Mining Techniques

Graph Detection and Estimation Theory

Class Prior Estimation from Positive and Unlabeled Data

Node similarity and classification

CS224W: Social and Information Network Analysis

Prof. Dr. Ralf Möller Dr. Özgür L. Özçep Universität zu Lübeck Institut für Informationssysteme. Tanya Braun (Exercises)

Learning Task Grouping and Overlap in Multi-Task Learning

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Content-based Recommendation

Empirical Risk Minimization, Model Selection, and Model Assessment

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

ML in Practice: CMSC 422 Slides adapted from Prof. CARPUAT and Prof. Roth

Algorithm-Independent Learning Issues

Naïve Bayes classification

10-810: Advanced Algorithms and Models for Computational Biology. Optimal leaf ordering and classification

Scalable and exact sampling method for probabilistic generative graph models

Unsupervised Learning with Permuted Data

Chapter ML:II (continued)

Measuring Social Influence Without Bias

Randomization Tests for Distinguishing Social Influence and Homophily Effects

Click Prediction and Preference Ranking of RSS Feeds

Logic and machine learning review. CS 540 Yingyu Liang

Modeling of Growing Networks with Directional Attachment and Communities

Data Mining. Chapter 5. Credibility: Evaluating What s Been Learned

The Naïve Bayes Classifier. Machine Learning Fall 2017

COMP9444: Neural Networks. Vapnik Chervonenkis Dimension, PAC Learning and Structural Risk Minimization

Decision Trees. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Machine Learning. Ensemble Methods. Manfred Huber

MINIMUM EXPECTED RISK PROBABILITY ESTIMATES FOR NONPARAMETRIC NEIGHBORHOOD CLASSIFIERS. Maya Gupta, Luca Cazzanti, and Santosh Srivastava

Topics in Natural Language Processing

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Applying Latent Dirichlet Allocation to Group Discovery in Large Graphs

Machine Learning, Midterm Exam: Spring 2009 SOLUTION

Machine Learning

Prediction of Citations for Academic Papers

A New Space for Comparing Graphs

Resampling Methods CAPT David Ruth, USN

When is undersampling effective in unbalanced classification tasks?

Mining Newsgroups Using Networks Arising From Social Behavior by Rakesh Agrawal et al. Presented by Will Lee

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

Exploring the Patterns of Human Mobility Using Heterogeneous Traffic Trajectory Data

Least Squares Regression

Bayesian Methods: Naïve Bayes

GraphRNN: A Deep Generative Model for Graphs (24 Feb 2018)

Models, Data, Learning Problems

Introduction to Machine Learning

Sampling and Estimation in Network Graphs

Least Squares Regression

Boosting. CAP5610: Machine Learning Instructor: Guo-Jun Qi

Loss Functions, Decision Theory, and Linear Models

Pointwise Exact Bootstrap Distributions of Cost Curves

Active Learning and Optimized Information Gathering

A Decision Stump. Decision Trees, cont. Boosting. Machine Learning 10701/15781 Carlos Guestrin Carnegie Mellon University. October 1 st, 2007

Specification and estimation of exponential random graph models for social (and other) networks

Machine Learning Ensemble Learning I Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi Spring /

11. Learning graphical models

Probability and Information Theory. Sargur N. Srihari

CS-E4830 Kernel Methods in Machine Learning

Learning in Probabilistic Graphs exploiting Language-Constrained Patterns

Bayesian Learning. CSL603 - Fall 2017 Narayanan C Krishnan

CS534 Machine Learning - Spring Final Exam

A Bayesian Approach to Concept Drift

Unified Modeling of User Activities on Social Networking Sites

Dynamic Data Modeling, Recognition, and Synthesis. Rui Zhao Thesis Defense Advisor: Professor Qiang Ji

Introduction to Machine Learning Fall 2017 Note 5. 1 Overview. 2 Metric

Data Mining Techniques

15-388/688 - Practical Data Science: Decision trees and interpretable models. J. Zico Kolter Carnegie Mellon University Spring 2018

Gaussian Processes. Le Song. Machine Learning II: Advanced Topics CSE 8803ML, Spring 2012

Overlapping Community Detection at Scale: A Nonnegative Matrix Factorization Approach

Conditional Marginalization for Exponential Random Graph Models

An Introduction to Exponential-Family Random Graph Models

Introduction to Machine Learning

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Undirected Graphical Models

DM-Group Meeting. Subhodip Biswas 10/16/2014

Transcription:

Supporting Statistical Hypothesis Testing Over Graphs Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Tina Eliassi-Rad, Brian Gallagher, Sergey Kirshner, Sebastian Moreno, SVN Vishwananthan, Tao Wang)

Social network prediction and mining Nodes: Facebook users Edges: Articulated friendships Example task: Predict user preferences based on friendships (e.g., political views)

Exploiting network relationships can improve classification Network autocorrelation (correlation of attributes across linked pairs) is ubiquitous: Citation analysis (Taskar et al. 01, Neville & Jensen 03) Fraud detection (Cortes et al. 01, Neville & Jensen 05) Marketing (Domingos et al. 01, Hill et al. 06) Statistical relational models improve accuracy by jointly classifying related instances

Network prediction models < X,Y i =1> < X,Y j =0> Goal: Estimate joint distribution P (Y {X} n,g)... or conditional distribution P (Y i X i, X R, Y R )

Network autocorrelation interacts with graph structure to impact accuracy of predictive models

Typical evaluation framework for ML methods Y X1 X2 Y X1 X2 Training set Learn model from training set Dataset Y X1 X2 F(X) Y X1 X2 Score: 77% Test set Apply model to test set Evaluate by comparing predicted labels f(x) to true labels y

Training sets Learned models M0 Test sets Dataset M1...... M8... Average error M9

Comparison of algorithm performance Sampling How to partition or sample available data into training and test sets? k-fold cross-validation is often used Significance test Is the observed difference in performance significantly greater than what would be expected by random chance? Null hypothesis (H0): Algorithm performance rates are drawn from the same distribution Two-sample t-test is often used Algorithm A Algorithm B

Implicit assumption of structured ML Domain consists of a population of independent graph samples Increase in data corresponds to acquiring more graphs When the graphs are independent, we can use straightforward statistical hypothesis tests

... but statistical learning algorithms are often applied to a single network In this case, an increase in dataset size corresponds Email networks! Social networks! to acquiring a larger Scientific sample networks! from the network... this changes the statistical foundation for analysis and learning Gene/protein networks! World wide web! Organizational networks!

How do we sample a single network when evaluating ML algorithms? Test set Training set Training set Training set Common approach Use repeated random sampling to create multiple sets of labeled/ unlabeled nodes

How does simple random sampling affect classifier evaluation?

Evaluation of paired t-test on network data Type I error: Incorrectly conclude that algorithms are different when they are not Type I Error Rate 0.0 0.1 0.2 0.3 0.4 0.5 Observed Type I error Up to 40% of the time algorithms will appear to be different when they are not! Expected Type I error 0.1 0.2 0.3 0.4 0.5 Proportion Labeled J. Neville, B. Gallagher, and T. Eliassi-Rad. Evaluating Statistical Tests for Within-Network Classifiers of Relational Data. In Proceedings of the 9th IEEE International Conference on Data Mining, 2009.

Network characteristics that lead to bias Test sets are dependent when network is resampled As the size of the test set increases, the overlap between test sets increases Network instances are dependent Dependencies among instances leads to correlated errors Correlated error increases the variance of algorithm performance Error correlation 0.6 0.45 0.3 0.15 0 Real Data 0 0.25 0.5 0.75 1 Autocorrelation

We can analytically evaluate the misestimation of variance and use it to adjust the significance test...

Analytical correction reduces Type I error Type I Error Rate 0.0 0.1 0.2 0.3 0.4 0.5 RS NCV RS C NCV C Random resampling Random resampling with correction 0.2 0.4 0.6 0.8 Proportion Labeled T. Wang, J. Neville, B. Gallagher, and T. Eliassi-Rad. Correcting Bias in Statistical Tests for Network Classifier Evaluation. In Proceedings of the 21st European Conference on Machine Learning, 2011.

But autocorrelation+structure performance... what about variation in network structure?

Training sets Learned models Test sets M0 Network dataset M1...... M8... Average error M9

How to generate network samples for statistical hypothesis testing? Sample subnetworks with similar structure Without replacement: N. Ahmed, J. Neville, and R. Kompella. 2011. Network sampling via edge-based node selection with graph induction. Technical Report 11-016, CS Dept, Purdue University. With replacement: H. Eldardiry and J. Neville. A Resampling Technique for Relational Data Graphs. In Proceedings of the 2nd SNA Workshop, 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2008. Randomize observed network T. LaFond and J. Neville. Randomization tests for distinguishing social influence and homophily effects. In Proceedings of the International World Wide Web Conference (WWW), 2010. Generate sample networks from probabilistic model S. Moreno, S. Kirshner, J. Neville, and S.V.N. Vishwanathan. Tied Kronecker Product Graph Models to Capture Variance in Network Populations. In Proceedings of the 48th Annual Allerton Conference on Communications, Control and Computing, 2010.

To support hypothesis testing, we need to accurately characterize the variability in graph populations... What is the natural variability of graph structure in real-world domains?

Purdue Facebook populations Public wall graph Temporal samples, with 1024 nodes & 60 days of posts

AddHealth populations Middle and high school social networks 25 networks with 800-2000 nodes

Can current graph models capture the variance we observe in real-world domains?

Kronecker product graph model (Leskovec & Faloutsos 07) Starting from initiator matrix of nxn Bernoulli parameters... A matrix of size n k is constructed using Kronecker multiplication... Then a network is generated by sampling from the Bernoulli rv in each cell Assuming Bernoulli rvs, the MLE for Θ can be learned from an observed network 0.9 0.5 e.g. Θ = 0.6 0.2 0.81 0.45 0.45 0.25 1 0 0 1 0.9Θ 0.5Θ 0.54 0.18 0.30 0.10 Θ Θ = = 1 0 0 0 0.6Θ 0.2Θ 0.54 0.30 0.18 0.10 0 0 1 0 0.36 0.12 0.12 0.04 1 0 0 0

Exponential random graph model (ERGM) (Frank & Strauss 86) Represents the joint probability over network structures with a log-linear model: Where f(g) is a set of graph features and θ is a vector of weights for the set of features Features count the occurrences of local subgraphs such as k-triangles or k- stars: Triangle 2-Triangle 2-star 3-star

Variation of generated graphs: Facebook Original population KPGM ERGM Degree Hop plot Clustering

Variation of generated graphs: AddHealth Both models produce surprisingly little variance Degree in their generated graphs Hop plot S. Moreno and J. Neville. An Investigation of the Distributional Characteristics of Generative Graph Models. In Proceedings of the 1st Workshop on Information in Networks, 2009. Clustering

Let s reconsider the KPGM generation process... as a hierarchy of Bernoulli trials

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker Graph generation= product of trial results from each level

KPGM generation process

Low variance is due to large number of independent Bernoulli trials... but we know that dependencies can increase variance, so what if trials are dependent instead?

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker

Tied KPGM generation Graph generation= product of trial results from each level

KPGM vs. tied KPGM Tied KPGMs exhibit more clustering, probably too much...

Mix KPGMs with tkpgms to incorporate dependencies for a portion of the hierarchy

Mixed KPGMs S. Moreno, S. Kirshner, J. Neville, and S.V.N. Vishwanathan. Tied Kronecker Product Graph Models to Capture Variance in Network Populations. In Proceedings of the 48th Annual Allerton Conference on Communications, Control and Computing, 2010.

Comparison of three models

How does parameter tying affect variance?

Variation with different mixing proportions 300 graphs with parameters: 0.99 0.20 = 0.20 0.77 Number of levels that are not tied Number of levels that are not tied Number of levels that are not tied Number of levels that are not tied

Learning mkpgm models Extending the KPGM parameter estimation algorithm for mkpgms If the mixing level is known, then we can alternate between sampling a permutation and estimating the parameters of the KPGM and tkpgm But, MLE estimation only works well when starting from very close to the true permutation, which is unknown in practice Heuristic approach for mkpgm parameter estimation Note the fractal structure of the mkpgm implies that some subgraphs are generated directly from an (equivalent) KPGM at a local scale We use snowball sampling to identify subgraphs that were likely to have been generated in the same block in the mkpgm model Then we estimate parameters locally within these sampled subgraphs, and combine the set of estimated parameters in a ensemble approach to estimate the overall mkpgm parameters

Experimental results: synthetic data mkpgm estimation method is able to recover the correct parameters in synthetic data experiments

Experimental results: Facebook data On real data, the estimated mkpgms exhibit more variance, and more accurately capture clustering

Towards a statistical hypothesis testing framework for anomaly detection

Anomaly detection with mkpgms Approach Learn mkpgm model from representative network Use learned mkpgm to generate empirical sampling distribution of likelihood, for networks drawn from the same population Compute likelihood of new observed network and flag as anomaly those that are very unlikely given sampling distribution above

Initial results: Facebook Likelihood based approach is able to distinguish random networks with same density

Open questions How to conduct accurate hypothesis tests? Dependencies and heterogeneous structure impact variance estimates and make it difficult to quantify effective sample size and assess significance of differences Thus, we need empirical sampling distribution of network structure... How to accurately model graph populations? There are many generative graph models but evaluation focuses mainly on matching the properties of a single observed network... What is the population? A single large, evolving graph process? or a collection of graphs with similar size/structure? How to evaluate the representativeness of a sample network?

Conclusion Relational dependencies can significantly improve predictions through the use of collective inference models......but current methods make assumptions about data and model characteristics that are often not appropriate Link information is heterogeneous, not uniform/stationary Label and attribute information is sparse, not fully labeled Data comprises a single network, not a population of networks In order to best exploit the relational information for prediction, we need to consider graph/data structure carefully and understand its impact on predictive modeling of attributes

Questions? neville@cs.purdue.edu www.cs.purdue.edu/~neville