Spectral Methods for Subgraph Detection

Similar documents
Subgraph Detection Using Eigenvector L1 Norms

Graph Detection and Estimation Theory

Temporal and Multi-Source Fusion for Detection of Innovation in Collaboration Networks

Statistical Evaluation of Spectral Methods for Anomaly Detection in Static Networks

Perfect Power Law Graphs: Generation, Sampling, Construction, and Fitting

Linear Dimensionality Reduction

14 Singular Value Decomposition

Lecture 1: From Data to Graphs, Weighted Graphs and Graph Laplacian

A New Space for Comparing Graphs

Community Detection. fundamental limits & efficient algorithms. Laurent Massoulié, Inria

Machine Learning 11. week

Dimensionality Reduction

Introduction to Signal Detection and Classification. Phani Chavali

15 Singular Value Decomposition

Computational paradigms for the measurement signals processing. Metodologies for the development of classification algorithms.

Quick Tour of Linear Algebra and Graph Theory

Structural measures for multiplex networks

AM205: Assignment 2. i=1

Face Recognition. Lecture-14

Relations Between Adjacency And Modularity Graph Partitioning: Principal Component Analysis vs. Modularity Component Analysis

Simulation study on using moment functions for sufficient dimension reduction

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Interact with Strangers

ECE 661: Homework 10 Fall 2014

Quantum Annealing Approaches to Graph Partitioning on the D-Wave System

Fisher s Linear Discriminant Analysis

Linear Classifiers as Pattern Detectors

Dimensionality Reduction and Principal Components

Principal Component Analysis

Table of Contents. Multivariate methods. Introduction II. Introduction I

Periodicity & State Transfer Some Results Some Questions. Periodic Graphs. Chris Godsil. St John s, June 7, Chris Godsil Periodic Graphs

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Kernel Methods. Barnabás Póczos

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Dimensionality Reduction and Principle Components

A new centrality measure for probabilistic diffusion in network

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

A Modular NMF Matching Algorithm for Radiation Spectra

Introduction to Machine Learning

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Foundations of Computer Vision

Linear Algebra for Machine Learning. Sargur N. Srihari

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Networks. Can (John) Bruce Keck Founda7on Biotechnology Lab Bioinforma7cs Resource

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Linear algebra and applications to graphs Part 1

Basics of Multivariate Modelling and Data Analysis

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Gene expression microarray technology measures the expression levels of thousands of genes. Research Article

1 Linearity and Linear Systems

Functional Analysis Review

PCA, Kernel PCA, ICA

Lecture 4 Discriminant Analysis, k-nearest Neighbors

MATHEMATICS (MATH) Calendar

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Math 205 A B Final Exam page 1 12/12/2012 Name

Groups of vertices and Core-periphery structure. By: Ralucca Gera, Applied math department, Naval Postgraduate School Monterey, CA, USA

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

Theory and Methods for the Analysis of Social Networks

Spectral Analysis of Directed Complex Networks. Tetsuro Murai

CS6220: DATA MINING TECHNIQUES

Regression Retrieval Overview. Larry McMillin

Image Analysis & Retrieval Lec 13 - Feature Dimension Reduction

DIMENSION REDUCTION AND CLUSTER ANALYSIS

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Multivariate Statistical Analysis

An Introduction to Correlation Stress Testing

Quick Tour of Linear Algebra and Graph Theory

Validated computation tool for the Perron-Frobenius eigenvalue using graph theory

Machine Learning for Data Science (CS4786) Lecture 11

Digital Image Processing Chapter 11 Representation and Description

18.06 Problem Set 9 - Solutions Due Wednesday, 21 November 2007 at 4 pm in

MACHINE LEARNING ADVANCED MACHINE LEARNING

Linear Algebra and Eigenproblems

LINK ANALYSIS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

ALGEBRA 2 MIDTERM REVIEW. Simplify and evaluate the expression for the given value of the variable:

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Community detection in stochastic block models via spectral methods

Lecture 3: Review of Linear Algebra

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Lecture 3: Review of Linear Algebra

Eigenface-based facial recognition

Vectors and Matrices Statistics with Vectors and Matrices

Linear Models Review

Motivating the Covariance Matrix

Exploiting Sparse Non-Linear Structure in Astronomical Data

Spectral Clustering - Survey, Ideas and Propositions

Chapter 11. Matrix Algorithms and Graph Partitioning. M. E. J. Newman. June 10, M. E. J. Newman Chapter 11 June 10, / 43

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Linear Algebra Methods for Data Mining

Mapping Class Groups MSRI, Fall 2007 Day 2, September 6

CS 143 Linear Algebra Review

b n x n + b n 1 x n b 1 x + b 0

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

8.1 Concentration inequality for Gaussian random matrix (cont d)

Link Analysis Ranking

Transcription:

Spectral Methods for Subgraph Detection Nadya T. Bliss & Benjamin A. Miller Embedded and High Performance Computing Patrick J. Wolfe Statistics and Information Laboratory Harvard University 12 July 2010 This work is sponsored by the Department of the Air Force under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions and recommendations are those of the author and are not necessarily endorsed by the United States Government. SIAM - 1 DISTRIBUTION STATEMENT A: Approved for public release: distribution is unlimited

Outline Introduction Approach Handling Large Graphs Summary SIAM - 2

Application Examples PEOPLE Wide variety of application domains Social network analysis Relationships between people RELATIONSHIP SUBGRAPH Biology Interactions between proteins Signal/image processing Discrimination and classification Computer Networks Failure detection Detect anomalies in the social network (detection) Identify actors (individuals) involved (identification) SIAM - 3

Subgraph Detection Problem Goal: Develop detection framework for finding subgraphs of interest in large graphs THRESHOLD G = G b + G f background graph + foreground graph G = G b background graph NOISE SIGNAL + NOISE H 0 H 1 H 0 : G = G b H 1 : G = G b + G f Detection problem: SIAM - 4 Given G, is H 0 or H 1 true? Graph Detection Challenges Background/foreground models Non-Euclidean data High-dimensional space

H 0 : Background Graph -Power Law- Power law degree distribution Many vertices with 10-20 edges Few vertices with > 40 edges A: Adjacency matrix of graph G 1024-vertex power law graph Degree distribution of graph G Real world graphs exhibit power law properties Well-defined generators exist Structural complexity presents a challenge for detection SIAM - 5

H 1 : Background Graph + Foreground Graph -Dense Subgraph in Power Law Graph- Subgraph, G f G f adjacency matrix Signal (target signature): dense subgraph SIGNAL EMBEDDING + G f on randomly selected vertices G b G b + G f Realistic scenario with subgraph connected to background Well controlled but challenging example allows rigorous analysis Some subgraphs of interest exhibit high density SIAM - 6 M. Skipper, Network biology: A protein network of one's own proteins, Nature Reviews Molecular Cell Biology 6, 824 (November 2005)

Outline Introduction Approach Handling Large Graphs Summary SIAM - 7

Graph-Based Residuals Analysis Linear Regression Graph Regression Least-squares residuals from a best-fit line Analysis of variance (ANOVA) describes fit Explained vs unexplained variance signal/noise discrimination Residuals from a best-fit graph model Analysis of variance from expected topology Unexplained variance in graph residuals subgraph detection ANALYSIS OF MODULARITY SIAM - 8

Overview Processing chain for subgraph detection analogous to a traditional signal processing chain MODULARITY MATRIX CONSTRUCTION EIGEN DECOMPOSITION COMPONENT SELECTION DETECTION IDENTIFICATION Input: A, adjacency matrix representation of G No cue Output: v s, set of vertices identified as belonging to subgraph G f SIAM - 9

Modularity Matrix* Construction 4 EXAMPLE: GRAPH G 1 2 3 7 6 5 1 2 3 4 5 6 1 2 3 4 5 6 7 B = A " KK T ADJACENCY MATRIX - NUMBER OF EDGES 1 20 M 2 3 3 3 3 2 DEGREE VECTOR * 2 3 3 3 3 2 4 7 4 Commonly used to evaluate quality of division of a graph into communities Application to subgraph detection Target signatures have connectivity patterns distinct from the background Can view target embedding as creation of a community SIAM - 10 *M. E. J. Newman. Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E, 74:036104, 2006.

Eigen Decomposition B =UDU T Eigenvalues, sorted by magnitude $ " 1 & & " 2 & & & %&... ' ) ) ) ) " N #1 ) " N () Corresponding eigenvectors [ u 1 u 2... u N "1 u N ] Principal components COLORING BASED ON KNOWN TRUTH Each point represents a vertex in G Vertices in G: 1024 Vertices in G f : 12 Uncued background/foreground separation SIAM - 11 Projection onto principal components of the modularity matrix yields good separation between background and foreground

Detection Powerlaw Background, 12-Vertex Dense Subgraph H 0 TEST STATISTIC: MULTIPLE TRIALS, G b ONLY H 1 SYMMETRY OF THE PROJECTION ONTO SELECTED COMPONENTS H 0 H 1 Test Statistic MULTIPLE TRIALS, G b + G f SIAM - 12 H 0 and H 1 distributions are well separated

Distribution of Test Statistics R-MAT Erdös Rényi H 0 Gamma distribution k=2, θ=4.62 Gamma distribution k=2, θ=1.91 H 1 SIAM - 13 Embedding a 12-vertex fully connected subgraph significantly changes the test statistic for both background models

Detection Performance Pd: True Positive Rate positives identified Power Law Background, 12-Vertex Dense Subgraph TPR = TP P all positives Pfa: False Positive Rate FPR = FP N negatives identified as positives all negatives Detection: Positive: G contains G f Variable Detector Characteristic: Threshold SUBGRAPH DENSITY Reliable, uncued detection of tightly connected groups SIAM - 14

Outline Introduction Approach Handling Large Graphs Summary SIAM - 15

Eigendecomposition of the Modularity Matrix -Revisited- B = Eigenvalues, sorted by magnitude $ " 1 & & " 2 & & & %&... UDU ' ) ) ) ) " N #1 ) " N () T = A! Corresponding eigenvectors KK [ u 1 u 2... u N "1 u N ] M T B is dense and thus cannot be stored for large graphs Solution: compute eigenvectors without storing B in memory Approach: create a function that accepts a vector x and returns Bx without computing B; compute the eigenvectors of this function SIAM - 16

Computing Eigenvectors of Large Graphs Bx can be computed without computing B Multiplication by B can be expressed as multiplication by a sparse matrix (A), plus a vector dot product and scalar-vector product This method is both space- and time-efficient The eigenvectors of f(x)= Ax K(K T x)/m are the eigenvectors of B Bx = Ax K(K T x)/m dense matrix-vector product: O( V 2 ) sparse matrix-vector product: O( E ) dot product: O( V ) scalar-vector product: O( V ) = SIAM - 17

Detection Performance -Large Graphs- Scenario Background: 2 20 vertices, Power Law Foreground: 35 vertices, 90% dense Spectral subgraph detection algorithm can be optimized by exploiting matrix properties Analysis of 2 20 vertex graph can be performed in minutes (~10) on a single laptop SIAM - 18

Detectability -With Increasing Background Size- SIAM - 19 Algorithm exhibits desired performance: as size of the background graph increases, minimum detectable subgraph size remains small

Epinions Data Analysis -Large Graph Example- Who-trusts-whom network from the Epinions consumer review site 75,879 vertices, 405,740 edges Modularity matrix: too large to store in memory Approach: compute eigenvectors of f(x)= Ax K(K T x)/m 200 eigenvectors in 155 seconds using MATLAB 36 th and 45 th largest eigenvectors Eigenvector 45 densely-connected clusters Eigenvector Index Eigenvector 36 SIAM - 20

Summary Subgraph detection is an important problem Detection framework for graphs enables algorithms and metrics development Results on simulated and real datasets demonstrate the power of the approach Demonstrated good detection performance Extended approach to very large graphs Understanding background statistics (noise and clutter model) is of key importance Current research Weak signature foregrounds Subgraph formation detection SIAM - 21