Supporting Statistical Hypothesis Testing Over Graphs

Supporting Statistical Hypothesis Testing Over Graphs Jennifer Neville Departments of Computer Science and Statistics Purdue University (joint work with Tina Eliassi-Rad, Brian Gallagher, Sergey Kirshner, Sebastian Moreno, SVN Vishwananthan, Tao Wang)

Social network prediction and mining Nodes: Facebook users Edges: Articulated friendships Example task: Predict user preferences based on friendships (e.g., political views)

Exploiting network relationships can improve classification Network autocorrelation (correlation of attributes across linked pairs) is ubiquitous: Citation analysis (Taskar et al. 01, Neville & Jensen 03) Fraud detection (Cortes et al. 01, Neville & Jensen 05) Marketing (Domingos et al. 01, Hill et al. 06) Statistical relational models improve accuracy by jointly classifying related instances

Network prediction models < X,Y i =1> < X,Y j =0> Goal: Estimate joint distribution P (Y {X} n,g)... or conditional distribution P (Y i X i, X R, Y R )

Network autocorrelation interacts with graph structure to impact accuracy of predictive models

Typical evaluation framework for ML methods Y X1 X2 Y X1 X2 Training set Learn model from training set Dataset Y X1 X2 F(X) Y X1 X2 Score: 77% Test set Apply model to test set Evaluate by comparing predicted labels f(x) to true labels y

Training sets Learned models M0 Test sets Dataset M1...... M8... Average error M9

Comparison of algorithm performance Sampling How to partition or sample available data into training and test sets? k-fold cross-validation is often used Significance test Is the observed difference in performance significantly greater than what would be expected by random chance? Null hypothesis (H0): Algorithm performance rates are drawn from the same distribution Two-sample t-test is often used Algorithm A Algorithm B

Implicit assumption of structured ML Domain consists of a population of independent graph samples Increase in data corresponds to acquiring more graphs When the graphs are independent, we can use straightforward statistical hypothesis tests

... but statistical learning algorithms are often applied to a single network In this case, an increase in dataset size corresponds Email networks! Social networks! to acquiring a larger Scientific sample networks! from the network... this changes the statistical foundation for analysis and learning Gene/protein networks! World wide web! Organizational networks!

How do we sample a single network when evaluating ML algorithms? Test set Training set Training set Training set Common approach Use repeated random sampling to create multiple sets of labeled/ unlabeled nodes

How does simple random sampling affect classifier evaluation?

Evaluation of paired t-test on network data Type I error: Incorrectly conclude that algorithms are different when they are not Type I Error Rate 0.0 0.1 0.2 0.3 0.4 0.5 Observed Type I error Up to 40% of the time algorithms will appear to be different when they are not! Expected Type I error 0.1 0.2 0.3 0.4 0.5 Proportion Labeled J. Neville, B. Gallagher, and T. Eliassi-Rad. Evaluating Statistical Tests for Within-Network Classifiers of Relational Data. In Proceedings of the 9th IEEE International Conference on Data Mining, 2009.

Network characteristics that lead to bias Test sets are dependent when network is resampled As the size of the test set increases, the overlap between test sets increases Network instances are dependent Dependencies among instances leads to correlated errors Correlated error increases the variance of algorithm performance Error correlation 0.6 0.45 0.3 0.15 0 Real Data 0 0.25 0.5 0.75 1 Autocorrelation

We can analytically evaluate the misestimation of variance and use it to adjust the significance test...

Analytical correction reduces Type I error Type I Error Rate 0.0 0.1 0.2 0.3 0.4 0.5 RS NCV RS C NCV C Random resampling Random resampling with correction 0.2 0.4 0.6 0.8 Proportion Labeled T. Wang, J. Neville, B. Gallagher, and T. Eliassi-Rad. Correcting Bias in Statistical Tests for Network Classifier Evaluation. In Proceedings of the 21st European Conference on Machine Learning, 2011.

But autocorrelation+structure performance... what about variation in network structure?

Training sets Learned models Test sets M0 Network dataset M1...... M8... Average error M9

How to generate network samples for statistical hypothesis testing? Sample subnetworks with similar structure Without replacement: N. Ahmed, J. Neville, and R. Kompella. 2011. Network sampling via edge-based node selection with graph induction. Technical Report 11-016, CS Dept, Purdue University. With replacement: H. Eldardiry and J. Neville. A Resampling Technique for Relational Data Graphs. In Proceedings of the 2nd SNA Workshop, 14th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2008. Randomize observed network T. LaFond and J. Neville. Randomization tests for distinguishing social influence and homophily effects. In Proceedings of the International World Wide Web Conference (WWW), 2010. Generate sample networks from probabilistic model S. Moreno, S. Kirshner, J. Neville, and S.V.N. Vishwanathan. Tied Kronecker Product Graph Models to Capture Variance in Network Populations. In Proceedings of the 48th Annual Allerton Conference on Communications, Control and Computing, 2010.

To support hypothesis testing, we need to accurately characterize the variability in graph populations... What is the natural variability of graph structure in real-world domains?

Purdue Facebook populations Public wall graph Temporal samples, with 1024 nodes & 60 days of posts

AddHealth populations Middle and high school social networks 25 networks with 800-2000 nodes

Can current graph models capture the variance we observe in real-world domains?

Kronecker product graph model (Leskovec & Faloutsos 07) Starting from initiator matrix of nxn Bernoulli parameters... A matrix of size n k is constructed using Kronecker multiplication... Then a network is generated by sampling from the Bernoulli rv in each cell Assuming Bernoulli rvs, the MLE for Θ can be learned from an observed network 0.9 0.5 e.g. Θ = 0.6 0.2 0.81 0.45 0.45 0.25 1 0 0 1 0.9Θ 0.5Θ 0.54 0.18 0.30 0.10 Θ Θ = = 1 0 0 0 0.6Θ 0.2Θ 0.54 0.30 0.18 0.10 0 0 1 0 0.36 0.12 0.12 0.04 1 0 0 0

Exponential random graph model (ERGM) (Frank & Strauss 86) Represents the joint probability over network structures with a log-linear model: Where f(g) is a set of graph features and θ is a vector of weights for the set of features Features count the occurrences of local subgraphs such as k-triangles or k- stars: Triangle 2-Triangle 2-star 3-star

Variation of generated graphs: Facebook Original population KPGM ERGM Degree Hop plot Clustering

Variation of generated graphs: AddHealth Both models produce surprisingly little variance Degree in their generated graphs Hop plot S. Moreno and J. Neville. An Investigation of the Distributional Characteristics of Generative Graph Models. In Proceedings of the 1st Workshop on Information in Networks, 2009. Clustering

Let s reconsider the KPGM generation process... as a hierarchy of Bernoulli trials

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker

KPGM generation process 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker Graph generation= product of trial results from each level

KPGM generation process

Low variance is due to large number of independent Bernoulli trials... but we know that dependencies can increase variance, so what if trials are dependent instead?

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker

Tied KPGM generation 1st level Kronecker colors=parameter values, cells=independent parameters 2nd level Kronecker 3rd level Kronecker

Tied KPGM generation Graph generation= product of trial results from each level

KPGM vs. tied KPGM Tied KPGMs exhibit more clustering, probably too much...

Mix KPGMs with tkpgms to incorporate dependencies for a portion of the hierarchy

Mixed KPGMs S. Moreno, S. Kirshner, J. Neville, and S.V.N. Vishwanathan. Tied Kronecker Product Graph Models to Capture Variance in Network Populations. In Proceedings of the 48th Annual Allerton Conference on Communications, Control and Computing, 2010.

Comparison of three models

How does parameter tying affect variance?

Variation with different mixing proportions 300 graphs with parameters: 0.99 0.20 = 0.20 0.77 Number of levels that are not tied Number of levels that are not tied Number of levels that are not tied Number of levels that are not tied

Learning mkpgm models Extending the KPGM parameter estimation algorithm for mkpgms If the mixing level is known, then we can alternate between sampling a permutation and estimating the parameters of the KPGM and tkpgm But, MLE estimation only works well when starting from very close to the true permutation, which is unknown in practice Heuristic approach for mkpgm parameter estimation Note the fractal structure of the mkpgm implies that some subgraphs are generated directly from an (equivalent) KPGM at a local scale We use snowball sampling to identify subgraphs that were likely to have been generated in the same block in the mkpgm model Then we estimate parameters locally within these sampled subgraphs, and combine the set of estimated parameters in a ensemble approach to estimate the overall mkpgm parameters

Experimental results: synthetic data mkpgm estimation method is able to recover the correct parameters in synthetic data experiments

Experimental results: Facebook data On real data, the estimated mkpgms exhibit more variance, and more accurately capture clustering

Towards a statistical hypothesis testing framework for anomaly detection

Anomaly detection with mkpgms Approach Learn mkpgm model from representative network Use learned mkpgm to generate empirical sampling distribution of likelihood, for networks drawn from the same population Compute likelihood of new observed network and flag as anomaly those that are very unlikely given sampling distribution above

Initial results: Facebook Likelihood based approach is able to distinguish random networks with same density

Open questions How to conduct accurate hypothesis tests? Dependencies and heterogeneous structure impact variance estimates and make it difficult to quantify effective sample size and assess significance of differences Thus, we need empirical sampling distribution of network structure... How to accurately model graph populations? There are many generative graph models but evaluation focuses mainly on matching the properties of a single observed network... What is the population? A single large, evolving graph process? or a collection of graphs with similar size/structure? How to evaluate the representativeness of a sample network?

Conclusion Relational dependencies can significantly improve predictions through the use of collective inference models......but current methods make assumptions about data and model characteristics that are often not appropriate Link information is heterogeneous, not uniform/stationary Label and attribute information is sparse, not fully labeled Data comprises a single network, not a population of networks In order to best exploit the relational information for prediction, we need to consider graph/data structure carefully and understand its impact on predictive modeling of attributes

Questions? neville@cs.purdue.edu www.cs.purdue.edu/~neville