Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting

Similar documents
Spectral Methods for Subgraph Detection

Subgraph Detection Using Eigenvector L1 Norms

Analytic Theory of Power Law Graphs

MATH 61-02: PRACTICE PROBLEMS FOR FINAL EXAM

Mini course on Complex Networks

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

Worksheet Topic 1 Order of operations, combining like terms 2 Solving linear equations 3 Finding slope between two points 4 Solving linear equations

7.5 Bipartite Matching

Graphic sequences, adjacency matrix

Advanced Combinatorial Optimization September 22, Lecture 4

Graph fundamentals. Matrices associated with a graph

Graph Detection and Estimation Theory

TOTAL JITTER MEASUREMENT THROUGH THE EXTRAPOLATION OF JITTER HISTOGRAMS

AM205: Assignment 2. i=1

Topics Covered in Math 115

Simple sparse matrices we have seen so far include diagonal matrices and tridiagonal matrices, but these are not the only ones.

Algebra 1 Mathematics: to Hoover City Schools

Balanced Allocation Through Random Walk

Algebra I. 60 Higher Mathematics Courses Algebra I

ORIE 4741: Learning with Big Messy Data. Spectral Graph Theory

Local Maxima and Improved Exact Algorithm for MAX-2-SAT

Alternative Parameterizations of Markov Networks. Sargur Srihari

4. How to prove a problem is NPC

Data Mining and Analysis: Fundamental Concepts and Algorithms

Algebra I Number and Quantity The Real Number System (N-RN)

Temporal and Multi-Source Fusion for Detection of Innovation in Collaboration Networks

Use a graphing utility to approximate the real solutions, if any, of the equation rounded to two decimal places. 4) x3-6x + 3 = 0 (-5,5) 4)

Algebra 1 Standards Curriculum Map Bourbon County Schools. Days Unit/Topic Standards Activities Learning Targets ( I Can Statements) 1-19 Unit 1

Increasing the Span of Stars

COMPSCI 611 Advanced Algorithms Second Midterm Exam Fall 2017

18.312: Algebraic Combinatorics Lionel Levine. Lecture 19

ACO Comprehensive Exam March 17 and 18, Computability, Complexity and Algorithms

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

Independent Transversals in r-partite Graphs

Algebra I. 60 Higher Mathematics Courses Algebra I

Mock Final Exam Name. Solve and check the linear equation. 1) (-8x + 8) + 1 = -7(x + 3) A) {- 30} B) {- 6} C) {30} D) {- 28}

School District of Marshfield Course Syllabus

Permutations and Combinations

ALGEBRA I CCR MATH STANDARDS

Communities, Spectral Clustering, and Random Walks

Project in Computational Game Theory: Communities in Social Networks

Essential Learning Outcomes for Algebra 2

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Genetic Algorithm approach to Solve Shortest Path and Travelling Salesman Problem

N-Q2. Define appropriate quantities for the purpose of descriptive modeling.

Inapproximability Ratios for Crossing Number

Aditya Bhaskara CS 5968/6968, Lecture 1: Introduction and Review 12 January 2016

Adjustments for the Display of Quantized Ion Channel Dwell Times in Histograms with Logarithmic Bins

Modelling self-organizing networks

ECLT 5810 Data Preprocessing. Prof. Wai Lam

Algebra 2A Unit 1 Week 1 Day Activity Unit 1 Week 2 Day Activity Unit 1 Week 3 Day Activity Unit 2 Week 1 Day Activity

Lecture 11 October 7, 2013

Graph Theorizing Peg Solitaire. D. Paul Hoilman East Tennessee State University

1 Functions and Graphs

Algebra II Vocabulary Alphabetical Listing. Absolute Maximum: The highest point over the entire domain of a function or relation.

MHCA Math Summer Packet 2015

On the Maximal Error of Spectral Approximation of Graph Bisection

Further Results on Square Sum Graph

THE EXTREMAL FUNCTIONS FOR TRIANGLE-FREE GRAPHS WITH EXCLUDED MINORS 1

Algebra , Martin-Gay

ANCOVA. Lecture 9 Andrew Ainsworth

12. LOCAL SEARCH. gradient descent Metropolis algorithm Hopfield neural networks maximum cut Nash equilibria

A = A U. U [n] P(A U ). n 1. 2 k(n k). k. k=1

Learning Module 1 - Basic Algebra Review (Appendix A)

Algebra 1A is a prerequisite course for Algebra 1B. Before beginning this course, you should be able to do the following:

MICHIGAN STANDARDS MAP for a Basic Grade-Level Program. Grade Eight Mathematics (Algebra I)

Spiral Review Probability, Enter Your Grade Online Quiz - Probability Pascal's Triangle, Enter Your Grade

Exact Algorithms for Dominating Induced Matching Based on Graph Partition

Asymptotic enumeration of sparse uniform linear hypergraphs with given degrees

Chapter 2. Limits and Continuity 2.6 Limits Involving Infinity; Asymptotes of Graphs

Notes 6 : First and second moment methods

Common Core State Standards for Mathematics Integrated Pathway: Mathematics III

To a large extent abandoned once gradient-based methods. One applicable class of methods is Generating set search

Conventional Matrix Operations

Algebra I, Common Core Correlation Document

North Carolina MATH I (Traditional) Pacing Guide

Common Core State Standards for Mathematics - High School

Trimester 1 Expectations CCSS covered Key Vocabulary Vertical Alignment

Algebra II. A2.1.1 Recognize and graph various types of functions, including polynomial, rational, and algebraic functions.

Randomized algorithm

Kinetics Teacher Answer Key Section I

6 Euler Circuits and Hamiltonian Cycles

A Graph Polynomial Approach to Primitivity

ACCRS/QUALITY CORE CORRELATION DOCUMENT: ALGEBRA I

RTM: Laws and a Recursive Generator for Weighted Time-Evolving Graphs

Ring Sums, Bridges and Fundamental Sets

Beal City High School Algebra 2A Curriculum and Alignment

Part V. Matchings. Matching. 19 Augmenting Paths for Matchings. 18 Bipartite Matching via Flows

Model Traditional Pathway: Model Algebra I Content Standards [AI]

k-protected VERTICES IN BINARY SEARCH TREES

Mathematics. Number and Quantity The Real Number System

Introductory Mathematics

Markov Random Fields

Alg II Syllabus (First Semester)

Mississippi ALGEBRA I (Traditional) Pacing Guide

Midterm 1. Your Exam Room: Name of Person Sitting on Your Left: Name of Person Sitting on Your Right: Name of Person Sitting in Front of You:

Spectral Generative Models for Graphs

Mathematics Standards for High School Algebra I

Random walks and anisotropic interpolation on graphs. Filip Malmberg

Computer Intensive Methods in Mathematical Statistics

Transcription:

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Jeremy Kepner SIAM Annual Meeting, Minneapolis, July 9, 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.

Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 2

Goals Develop a background model for graphs based on perfect power law Examine effects of sampling such a power law Develop techniques for comparing real data with a power law model D4M- 3

Detection Theory DETECTION OF SIGNAL IN NOISE THRESHOLD DETECTION OF SUBGRAPHS IN GRAPHS Example subgraph of interest: Fully connected (complete) SIGNAL NOISE SIGNAL H 0 H 1 NOISE N-D SPACE ASSUMPTIONS Background (noise) statistics Foreground (signal) statistics Foreground/background separation Model reality Example background model: Powerlaw graph Can we construct a background model based on power law degree distribution? D4M- 4

Perfect Power Law Matrix Definition N out Adjacency/Incidence Matrix N in A M = ΣA edges n(d out ) number of vertices 10 5 10 4 10 3 10 2 10 1 Vertex Out Degree Distribution -α out 10 0 10 0 10 1 10 2 10 3 n(d in ) number of vertices 10 3 10 2 10 1 Vertex In Degree Distribution -α in 10 0 10 0 10 1 10 2 10 3 10 4 10 5 in degree, d in out degree, d out Graph represented as a rectangular sparse matrix Can be undirected, multi-edged, self-loops, disconnected, hyper edges, Out/in degree distributions are independent first order statistics Only constraint: Σ n(d out ) d out = Σ n(d in ) d in = M D4M- 5

Power Law Distribution Construction Perfect power law matlab code n 1 function [di ni] = PPL(alpha,dmax,Nd)! logdi = (0:Nd) * log(dmax) / Nd;! di = unique(round(exp(logdi)));! logni = alpha * (log(dmax) - log(di));! ni = round(exp(logni));! n(d i ) = n 1 /d i α Parameters alpha = slope dmax = largest degree vertex Nd = number of bins (before unique) 1 1 2 3 8 16 32 d max integer logarithmic Simple algorithm naturally generates perfect power law Smooth transition from integer to logarithmic bins Poor man s slope estimator: α = log(n 1 )/log(d max ) D4M- 6

Power Law Edge Construction Power law vertex list matlab code! function v = PowerLawEdges(di,ni);! A1 = sparse(1:numel(di),ni,di);! A2 = fliplr(cumsum(fliplr(a1),2));! [tmp tmp d] = find(a2);! A3 = sparse(1:numel(d),d,1);! A4 = fliplr(cumsum(fliplr(a3),2));! [v tmp tmp] = find(a4);! Degree distribution independent of Vertex labels Edge pairing Edge order random edge pairs random vertex labels Algorithm generates list of vertices corresponding to any distribution All other aspects of graph can be set based on desired properties D4M- 7

Fitting α, N, M Power law model works for any α > 0, d max > 1, N d > 1 10 4 Allowed N and M for α = 1.3 Desire distribution that fits α, N, M Can invert formulas N = Σ i n(d i ) M = Σ i n(d i ) d i M M 10 3 10 2 10 3 10 4 Highly non-linear; requires a combination of Exhaustive search, simulated annealing, and Broyden s algorithm Given α, N, M can solve for N d and d max Not all combinations of α, N, M are consistent with power law 10 2 N N D4M- 8

Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 9

Graph Construction Effects Generate a perfect power law NxN randomize adjacency matrix A α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Make undirected, unweighted, with no self-loops A = triu(a + A );! A = double(logical(a));! A = A - diag(diag(a)); count 10 4 measured input model 10 3 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 degree Graph theory best for undirected, unweighted graphs with no self-loops Often clean up real data to apply graph theory results Process mimics bent broom distribution seen in real data sets D4M- 10

Power Law Recovery Procedure Compute α, N, M from measured Fit perfect power law to these parameters Rebin measured data using perfect power law degree bins count 10 4 measured α 10 3 model fit rebin 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 degree Perfect power law fit to cleaned up graph can recover much of the shape of the original distribution D4M- 11

Correlation Construction Effects Generate a perfect power law NxN randomize incidence matrix E α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Make unweighted and use to form correlation matrix A with no self-loops count 10 4 measured input model 10 3 10 2 10 1 E = double(logical(e));! A = triu(e * E);! A = A - diag(diag(a)); 10 0 10 0 10 1 10 2 10 3 10 4 10 5 degree Correlation graph construction from incidence matrix results in a bent broom distribution that strongly resembles a power law D4M- 12

Power Law Lost Procedure Compute α, N, M from measured Fit perfect power law to these parameters Rebin measured data using perfect power law degree bins count 10 4 measured α model fit 10 3 rebin 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 10 5 degree Perfect power law fit to correlation shows non-power law shape Reveals witches nose distribution D4M- 13

Power Law Preserved In degree is power law α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Out degree is constant N = 16K, M = 84K Edges/row = 5 (exactly) count 10 4 measured α model fit rebin 10 3 10 2 10 1 Make unweighted and use to form correlation matrix A with no self-loops 10 0 10 0 10 1 10 2 10 3 10 4 degree Uniform distribution on correlated dimension preserves power law shape D4M- 14

Edge Ordering: Densification Compute M/N cumulatively and piecewise for 2 orderings Linear Random By definition M/N goes from 1 to infinity for finite N M/N 6 5 4 3 Linear Elimination of multi-edges reduces M and causes M/N to grow more slowly 2 random 1 0 0.2 0.4 0.6 0.8 1 fraction of total edges Densification is the observation that M/N increases with N Densification is a natural byproduct of randomly drawing edges from a power law distribution Linear ordering has constant M/N D4M- 15

Edge Ordering: Power Law Exponent (α) Compute α cumulatively and piecewise for 2 orderings Linear Random α Edge ordering and sampling have large effect on the power law exponent 2 1.8 1.6 1.4 1.2 random random cumulative linear cumulative linear 1 0 0.2 0.4 0.6 0.8 1 fraction of total edges Power law exponent is fundamental to distribution Strongly dependent on edge ordering and sample size D4M- 16

Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 17

Sub-Sampling Challenge Anomaly detection requires good estimates of background Traversing entire data sets to compute background counts is increasingly prohibitive Can be done at ingest, but often is not Can background be accurately estimated from a sub-sample of the entire data set? D4M- 18

Sampling a Power Law Whole distribution Generate power law Select fraction of edges 1/40 sample D4M- 19

Linear Degree Estimate Whole distribution Divide measured degree by fraction Accurate for high degree Overestimates low degree Can we do better? Linear estimate D4M- 20

Non-Linear Degree Estimate Whole distribution Assume power law input Create non-linear estimate Matches median degree Non-Linear estimate D4M- 21

Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 22

Reuter s Incidence Matrix Entities extracted from Reuter s Corpus E(i,j) = # times entity appeared in document LOCATION ORGANIZTION PERSON TIME 19960820_2304.txt 19960825_12989.txt 19960830_23589.txt 19960905_33974.txt 19960910_44444.txt 19960916_54917.txt 19960919_65253.txt 19960925_75717.txt DOCUMENT Ndoc = 797677 Nent = 47576 M = 6132286 19960930_86206.txt 19961004_96462.txt 19961009_106877.txt 19961015_117306.txt Four entity classes with different statistics 19961018_127737.txt E 19961023_138177.txt LOCATION ORGANZATION PERSON TIME 19961028_148557.txt 19961031_158890.txt 19961106_169358.txt 19961111_179616.txt 19970103_285928.txt 19970804_777356.txt LOCATION/addis LOCATION/el LOCATION/malaysia ababsalvado LOCATION/seychelles PERSON/abdullah PERSON/ashley PERSON/christian bad PERSON/drew giles PERSON/henry ma PERSON/jimmy peck PERSON/khaled wilson PERSON/michael hayes salam PERSON/paul PERSON/ryan pric wells PERSON/terrell beck TIME/1990 08 01 owen TIME/1996 08 29 TIME/1997 04 01 TIME/2006 02 28 TIME/2099 12 Fit power law model to each entity class D4M- 23

E(:,PERSON) Degree Distribution 10 6 E(:,PERSON) document distribution 10 5 measured α model fit rebin 10 5 E(:,PERSON) entity distribution 10 4 measured α model fit rebin 10 4 10 3 n(d doc ) 10 3 10 2 n(d ent ) 10 2 10 1 10 1 10 0 10 0 10 1 10 2 10 3 10 0 10 0 10 1 10 2 10 3 10 4 d doc d ent M N M/N α M fit N fit M fit /N fit Document 299333 170069 1.76 1.92 302478 170066 1.78 Entity 299333 37191 8.05 1.21 299748 37449 8.00 D4M- 24

Procedure Make unweighted and use to form correlation matrix A with no selfloops E(:,PERSON) t x E(:,PERSON) E = double(logical(e));! A = triu(e * E);! A = A - diag(diag(a)); Perfect power law fit to correlation shows non-power law shape Reveals witches nose distribution D4M- 25

Document Densification Constant M/N consistent with sequential ordering of documents D4M- 26

Entity Densification Increasing M/N consistent with random ordering of entities D4M- 27

Document Power Law Exponent (α) 2.8 2.6 location organization person time 2.4 2.2 α document 2 1.8 1.6 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fraction of total edges Increasing α consistent with sequential ordering of documents D4M- 28

Entity Power Law Exponent (α) 2.5 2 location organization person time 1.5 α entity 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fraction of total edges Decreasing α consistent with random ordering of entities D4M- 29

Summary Developed a background model for graphs based on perfect power law Can be done via simple heuristic Reproduces much of observed phenomena Examine effects of sampling such a power law Lossy, non-linear transformation of graph construction mirrors many observed phenomena Traditional sampling approaches significantly overestimate the probability of low degree vertices Assuming a power law distribution it is possible to construct a simple non-linear estimate that is more accurate Develop techniques for comparing real data with a power law model Can fit perfect power-law to observed data Provided binning for statistical tests D4M- 30

Acknowledgements Nicholas Arcolano Michelle Beard Nadya Bliss Bob Bond Matthew Schmidt Ben Miller Bill Arcand Bill Bergeron David Bestor Chansup Byun, Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee D4M- 31

D4M- 32 Appendix

Sub-Sampling Formula f = fraction of total edges sampled n 1 = # of vertices of degree 1 d max = maximum degree Allowed slope: ln(n 1 )/ln(d max /f) < α < ln(n 1 )/ln(d max ) Cumulative distribution P(α,d) = (f 1-α d max α / n 1 ) Σ i<d i 1-α e -fi Find α* such that P(α*, ) = 1 Find d 50% such that P(α*,d 50% ) = ½ Compute K = 1/(1 + ln(d 50% )/ln(f)) Non-linear estimate of true degree of vertex v from sample d(v) d(v) = d(v) / f1-1/(k d(v)) D4M- 33