Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting Jeremy Kepner SIAM Annual Meeting, Minneapolis, July 9, 2012 This work is sponsored by the Department of the Air Force under Air Force Contract #FA8721-05-C-0002. Opinions, interpretations, recommendations and conclusions are those of the authors and are not necessarily endorsed by the United States Government.
Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 2
Goals Develop a background model for graphs based on perfect power law Examine effects of sampling such a power law Develop techniques for comparing real data with a power law model D4M- 3
Detection Theory DETECTION OF SIGNAL IN NOISE THRESHOLD DETECTION OF SUBGRAPHS IN GRAPHS Example subgraph of interest: Fully connected (complete) SIGNAL NOISE SIGNAL H 0 H 1 NOISE N-D SPACE ASSUMPTIONS Background (noise) statistics Foreground (signal) statistics Foreground/background separation Model reality Example background model: Powerlaw graph Can we construct a background model based on power law degree distribution? D4M- 4
Perfect Power Law Matrix Definition N out Adjacency/Incidence Matrix N in A M = ΣA edges n(d out ) number of vertices 10 5 10 4 10 3 10 2 10 1 Vertex Out Degree Distribution -α out 10 0 10 0 10 1 10 2 10 3 n(d in ) number of vertices 10 3 10 2 10 1 Vertex In Degree Distribution -α in 10 0 10 0 10 1 10 2 10 3 10 4 10 5 in degree, d in out degree, d out Graph represented as a rectangular sparse matrix Can be undirected, multi-edged, self-loops, disconnected, hyper edges, Out/in degree distributions are independent first order statistics Only constraint: Σ n(d out ) d out = Σ n(d in ) d in = M D4M- 5
Power Law Distribution Construction Perfect power law matlab code n 1 function [di ni] = PPL(alpha,dmax,Nd)! logdi = (0:Nd) * log(dmax) / Nd;! di = unique(round(exp(logdi)));! logni = alpha * (log(dmax) - log(di));! ni = round(exp(logni));! n(d i ) = n 1 /d i α Parameters alpha = slope dmax = largest degree vertex Nd = number of bins (before unique) 1 1 2 3 8 16 32 d max integer logarithmic Simple algorithm naturally generates perfect power law Smooth transition from integer to logarithmic bins Poor man s slope estimator: α = log(n 1 )/log(d max ) D4M- 6
Power Law Edge Construction Power law vertex list matlab code! function v = PowerLawEdges(di,ni);! A1 = sparse(1:numel(di),ni,di);! A2 = fliplr(cumsum(fliplr(a1),2));! [tmp tmp d] = find(a2);! A3 = sparse(1:numel(d),d,1);! A4 = fliplr(cumsum(fliplr(a3),2));! [v tmp tmp] = find(a4);! Degree distribution independent of Vertex labels Edge pairing Edge order random edge pairs random vertex labels Algorithm generates list of vertices corresponding to any distribution All other aspects of graph can be set based on desired properties D4M- 7
Fitting α, N, M Power law model works for any α > 0, d max > 1, N d > 1 10 4 Allowed N and M for α = 1.3 Desire distribution that fits α, N, M Can invert formulas N = Σ i n(d i ) M = Σ i n(d i ) d i M M 10 3 10 2 10 3 10 4 Highly non-linear; requires a combination of Exhaustive search, simulated annealing, and Broyden s algorithm Given α, N, M can solve for N d and d max Not all combinations of α, N, M are consistent with power law 10 2 N N D4M- 8
Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 9
Graph Construction Effects Generate a perfect power law NxN randomize adjacency matrix A α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Make undirected, unweighted, with no self-loops A = triu(a + A );! A = double(logical(a));! A = A - diag(diag(a)); count 10 4 measured input model 10 3 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 degree Graph theory best for undirected, unweighted graphs with no self-loops Often clean up real data to apply graph theory results Process mimics bent broom distribution seen in real data sets D4M- 10
Power Law Recovery Procedure Compute α, N, M from measured Fit perfect power law to these parameters Rebin measured data using perfect power law degree bins count 10 4 measured α 10 3 model fit rebin 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 degree Perfect power law fit to cleaned up graph can recover much of the shape of the original distribution D4M- 11
Correlation Construction Effects Generate a perfect power law NxN randomize incidence matrix E α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Make unweighted and use to form correlation matrix A with no self-loops count 10 4 measured input model 10 3 10 2 10 1 E = double(logical(e));! A = triu(e * E);! A = A - diag(diag(a)); 10 0 10 0 10 1 10 2 10 3 10 4 10 5 degree Correlation graph construction from incidence matrix results in a bent broom distribution that strongly resembles a power law D4M- 12
Power Law Lost Procedure Compute α, N, M from measured Fit perfect power law to these parameters Rebin measured data using perfect power law degree bins count 10 4 measured α model fit 10 3 rebin 10 2 10 1 10 0 10 0 10 1 10 2 10 3 10 4 10 5 degree Perfect power law fit to correlation shows non-power law shape Reveals witches nose distribution D4M- 13
Power Law Preserved In degree is power law α = 1.3, d max = 1000, N d = 50 N = 18K, M = 84K Out degree is constant N = 16K, M = 84K Edges/row = 5 (exactly) count 10 4 measured α model fit rebin 10 3 10 2 10 1 Make unweighted and use to form correlation matrix A with no self-loops 10 0 10 0 10 1 10 2 10 3 10 4 degree Uniform distribution on correlated dimension preserves power law shape D4M- 14
Edge Ordering: Densification Compute M/N cumulatively and piecewise for 2 orderings Linear Random By definition M/N goes from 1 to infinity for finite N M/N 6 5 4 3 Linear Elimination of multi-edges reduces M and causes M/N to grow more slowly 2 random 1 0 0.2 0.4 0.6 0.8 1 fraction of total edges Densification is the observation that M/N increases with N Densification is a natural byproduct of randomly drawing edges from a power law distribution Linear ordering has constant M/N D4M- 15
Edge Ordering: Power Law Exponent (α) Compute α cumulatively and piecewise for 2 orderings Linear Random α Edge ordering and sampling have large effect on the power law exponent 2 1.8 1.6 1.4 1.2 random random cumulative linear cumulative linear 1 0 0.2 0.4 0.6 0.8 1 fraction of total edges Power law exponent is fundamental to distribution Strongly dependent on edge ordering and sample size D4M- 16
Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 17
Sub-Sampling Challenge Anomaly detection requires good estimates of background Traversing entire data sets to compute background counts is increasingly prohibitive Can be done at ingest, but often is not Can background be accurately estimated from a sub-sample of the entire data set? D4M- 18
Sampling a Power Law Whole distribution Generate power law Select fraction of edges 1/40 sample D4M- 19
Linear Degree Estimate Whole distribution Divide measured degree by fraction Accurate for high degree Overestimates low degree Can we do better? Linear estimate D4M- 20
Non-Linear Degree Estimate Whole distribution Assume power law input Create non-linear estimate Matches median degree Non-Linear estimate D4M- 21
Outline Introduction Sampling Sub-sampling Reuter s Data Summary D4M- 22
Reuter s Incidence Matrix Entities extracted from Reuter s Corpus E(i,j) = # times entity appeared in document LOCATION ORGANIZTION PERSON TIME 19960820_2304.txt 19960825_12989.txt 19960830_23589.txt 19960905_33974.txt 19960910_44444.txt 19960916_54917.txt 19960919_65253.txt 19960925_75717.txt DOCUMENT Ndoc = 797677 Nent = 47576 M = 6132286 19960930_86206.txt 19961004_96462.txt 19961009_106877.txt 19961015_117306.txt Four entity classes with different statistics 19961018_127737.txt E 19961023_138177.txt LOCATION ORGANZATION PERSON TIME 19961028_148557.txt 19961031_158890.txt 19961106_169358.txt 19961111_179616.txt 19970103_285928.txt 19970804_777356.txt LOCATION/addis LOCATION/el LOCATION/malaysia ababsalvado LOCATION/seychelles PERSON/abdullah PERSON/ashley PERSON/christian bad PERSON/drew giles PERSON/henry ma PERSON/jimmy peck PERSON/khaled wilson PERSON/michael hayes salam PERSON/paul PERSON/ryan pric wells PERSON/terrell beck TIME/1990 08 01 owen TIME/1996 08 29 TIME/1997 04 01 TIME/2006 02 28 TIME/2099 12 Fit power law model to each entity class D4M- 23
E(:,PERSON) Degree Distribution 10 6 E(:,PERSON) document distribution 10 5 measured α model fit rebin 10 5 E(:,PERSON) entity distribution 10 4 measured α model fit rebin 10 4 10 3 n(d doc ) 10 3 10 2 n(d ent ) 10 2 10 1 10 1 10 0 10 0 10 1 10 2 10 3 10 0 10 0 10 1 10 2 10 3 10 4 d doc d ent M N M/N α M fit N fit M fit /N fit Document 299333 170069 1.76 1.92 302478 170066 1.78 Entity 299333 37191 8.05 1.21 299748 37449 8.00 D4M- 24
Procedure Make unweighted and use to form correlation matrix A with no selfloops E(:,PERSON) t x E(:,PERSON) E = double(logical(e));! A = triu(e * E);! A = A - diag(diag(a)); Perfect power law fit to correlation shows non-power law shape Reveals witches nose distribution D4M- 25
Document Densification Constant M/N consistent with sequential ordering of documents D4M- 26
Entity Densification Increasing M/N consistent with random ordering of entities D4M- 27
Document Power Law Exponent (α) 2.8 2.6 location organization person time 2.4 2.2 α document 2 1.8 1.6 1.4 1.2 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fraction of total edges Increasing α consistent with sequential ordering of documents D4M- 28
Entity Power Law Exponent (α) 2.5 2 location organization person time 1.5 α entity 1 0.5 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 fraction of total edges Decreasing α consistent with random ordering of entities D4M- 29
Summary Developed a background model for graphs based on perfect power law Can be done via simple heuristic Reproduces much of observed phenomena Examine effects of sampling such a power law Lossy, non-linear transformation of graph construction mirrors many observed phenomena Traditional sampling approaches significantly overestimate the probability of low degree vertices Assuming a power law distribution it is possible to construct a simple non-linear estimate that is more accurate Develop techniques for comparing real data with a power law model Can fit perfect power-law to observed data Provided binning for statistical tests D4M- 30
Acknowledgements Nicholas Arcolano Michelle Beard Nadya Bliss Bob Bond Matthew Schmidt Ben Miller Bill Arcand Bill Bergeron David Bestor Chansup Byun, Matt Hubbell Pete Michaleas Julie Mullen Andy Prout Albert Reuther Tony Rosa Charles Yee D4M- 31
D4M- 32 Appendix
Sub-Sampling Formula f = fraction of total edges sampled n 1 = # of vertices of degree 1 d max = maximum degree Allowed slope: ln(n 1 )/ln(d max /f) < α < ln(n 1 )/ln(d max ) Cumulative distribution P(α,d) = (f 1-α d max α / n 1 ) Σ i<d i 1-α e -fi Find α* such that P(α*, ) = 1 Find d 50% such that P(α*,d 50% ) = ½ Compute K = 1/(1 + ln(d 50% )/ln(f)) Non-linear estimate of true degree of vertex v from sample d(v) d(v) = d(v) / f1-1/(k d(v)) D4M- 33