Using Local Spectral Methods in Theory and in Practice

Similar documents
Locally-biased analytics

Lecture: Local Spectral Methods (1 of 4)

Lecture: Local Spectral Methods (3 of 4) 20 An optimization perspective on local spectral methods

Facebook Friends! and Matrix Functions

Lecture: Local Spectral Methods (2 of 4) 19 Computing spectral ranking with the push procedure

Exploiting Sparse Non-Linear Structure in Astronomical Data

Graph Partitioning Using Random Walks

Three right directions and three wrong directions for tensor research

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

First, parse the title...

Lecture: Some Practical Considerations (3 of 4)

Mapping the Similarities of Spectra: Global and Locally-biased Approaches to SDSS Galaxy Data

Semi-supervised Eigenvectors for Locally-biased Learning

Learning gradients: prescriptive models

Spectral Graph Theory Lecture 2. The Laplacian. Daniel A. Spielman September 4, x T M x. ψ i = arg min

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

Predicting Graph Labels using Perceptron. Shuang Song

CMPSCI 791BB: Advanced ML: Laplacian Learning

Graphs, Geometry and Semi-supervised Learning

Lecture 24: Element-wise Sampling of Graphs and Linear Equation Solving. 22 Element-wise Sampling of Graphs and Linear Equation Solving

Spectral Clustering. Guokun Lai 2016/10

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

Four graph partitioning algorithms. Fan Chung University of California, San Diego

Learning from Labeled and Unlabeled Data: Semi-supervised Learning and Ranking p. 1/31

A New Space for Comparing Graphs

A Local Spectral Method for Graphs: With Applications to Improving Graph Partitions and Exploring Data Graphs Locally

Cutting Graphs, Personal PageRank and Spilling Paint

An Optimization Approach to Locally-Biased Graph Algorithms

Data Analysis and Manifold Learning Lecture 7: Spectral Clustering

CS-E4830 Kernel Methods in Machine Learning

Statistical Machine Learning

CS 188: Artificial Intelligence Spring Announcements

How to learn from very few examples?

Spectral Graph Theory and You: Matrix Tree Theorem and Centrality Metrics

Regularized Laplacian Estimation and Fast Eigenvector Approximation

Jeff Howbert Introduction to Machine Learning Winter

Personal PageRank and Spilling Paint

Graph Metrics and Dimension Reduction

Fitting a Graph to Vector Data. Samuel I. Daitch (Yale) Jonathan A. Kelner (MIT) Daniel A. Spielman (Yale)

Lecture 13: Spectral Graph Theory

Laplacian Eigenmaps for Dimensionality Reduction and Data Representation

Analysis of Spectral Kernel Design based Semi-supervised Learning

Kernel Methods and Support Vector Machines

Networks and Their Spectra

Algorithmic Primitives for Network Analysis: Through the Lens of the Laplacian Paradigm

Graphs, Vectors, and Matrices Daniel A. Spielman Yale University. AMS Josiah Willard Gibbs Lecture January 6, 2016

Nonlinear Data Transformation with Diffusion Map

Link Prediction. Eman Badr Mohammed Saquib Akmal Khan

Laplacian Matrices of Graphs: Spectral and Electrical Theory

Data-dependent representations: Laplacian Eigenmaps

Is Manifold Learning for Toy Data only?

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

Contribution from: Springer Verlag Berlin Heidelberg 2005 ISBN

Graph-Based Semi-Supervised Learning

Lecture: Flow-based Methods for Partitioning Graphs (1 of 2)

Link Analysis Ranking

Lecture: Some Statistical Inference Issues (2 of 3)

Structured deep models: Deep learning on graphs and beyond

Beyond the Point Cloud: From Transductive to Semi-Supervised Learning

Undirected Graphical Models

Linear and Sublinear Linear Algebra Algorithms: Preconditioning Stochastic Gradient Algorithms with Randomized Linear Algebra

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Support Vector Machines

Algebraic Representation of Networks

CS534 Machine Learning - Spring Final Exam

Diffusion and random walks on graphs

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

Parallel Local Graph Clustering

Nonlinear Classification

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

THE HIDDEN CONVEXITY OF SPECTRAL CLUSTERING

CS6220: DATA MINING TECHNIQUES

Data Analysis and Manifold Learning Lecture 3: Graphs, Graph Matrices, and Graph Embeddings

Online Social Networks and Media. Link Analysis and Web Search

CS168: The Modern Algorithmic Toolbox Lectures #11 and #12: Spectral Graph Theory

Communities, Spectral Clustering, and Random Walks

Lecture 10. Lecturer: Aleksander Mądry Scribes: Mani Bastani Parizi and Christos Kalaitzis

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

Randomized Algorithms

Machine Learning for NLP

CS 188: Artificial Intelligence Spring Announcements

Heat Kernel Based Community Detection

8.1 Concentration inequality for Gaussian random matrix (cont d)

Learning on Graphs and Manifolds. CMPSCI 689 Sridhar Mahadevan U.Mass Amherst

Probabilistic Machine Learning. Industrial AI Lab.

CPSC 540: Machine Learning

CS6220: DATA MINING TECHNIQUES

Certifying the Global Optimality of Graph Cuts via Semidefinite Programming: A Theoretic Guarantee for Spectral Clustering

Collaborative topic models: motivations cont

Introduction to Support Vector Machines

1 Searching the World Wide Web

Semi-Supervised Learning

Intrinsic Structure Study on Whale Vocalizations

Slides based on those in:

Unsupervised dimensionality reduction

Markov Chains and Spectral Clustering

An Introduction to Machine Learning

Diffusion Wavelets and Applications

DATA MINING LECTURE 13. Link Analysis Ranking PageRank -- Random walks HITS

Transcription:

Using Local Spectral Methods in Theory and in Practice Michael W. Mahoney ICSI and Dept of Statistics, UC Berkeley ( For more info, see: http: // www. stat. berkeley. edu/ ~ mmahoney or Google on Michael Mahoney ) December 2015 1/33

Global spectral methods Given the Laplacian L of a graph G = (V, E), solve the following: minimize x T Lx subject to x T Dx = 1 x T D1 = 0 Good News: Things one can prove about this. Solution can be found by computing an eigenvector of L. Solution can be found by running a random walk to. Solution is quadratically good (i.e., Cheeger s Inequality). Solution can be used for clustering, classification, ranking, etc. Bad News: This is a very global thing and so often not useful. Can we localize it in some meaningful way? 2/33

Outline Motivation 1: Social and information networks Motivation 2: Machine learning data graphs Local spectral methods in worst-case theory Local spectral methods to robustify graph construction

Networks and networked data Lots of networked data!! technological networks (AS, power-grid, road networks) biological networks (food-web, protein networks) social networks (collaboration networks, friendships) information networks (co-citation, blog cross-postings, advertiser-bidded phrase graphs...) language networks (semantic networks...)... Interaction graph model of networks: Nodes represent entities Edges represent interaction between pairs of entities 4/33

Three different types of real networks 10 0 10 2 10 1 conductance 10 1 10 2 10 3 10 0 10 1 10 2 10 3 10 4 size CA-GrQc FB-Johns55 US-Senate (a) NCP: conductance value of best conductance set, as function of size conductance ratio 10 0 10 1 10 2 10 3 10 0 10 1 10 2 10 3 10 4 size CA-GrQc FB-Johns55 US-Senate (b) CRP: ratio of internal to external conductance, as function of size 1 0.5 0 (c) CA-GrQc (d) FB-Johns55 (e) US-Senate 5/33

Information propagates local-to-glocal in different networks in different ways 6/33

Outline Motivation 1: Social and information networks Motivation 2: Machine learning data graphs Local spectral methods in worst-case theory Local spectral methods to robustify graph construction

Use case 1 : Galactic spectra from SDSS x i R 3841, N 500k photon fluxes in 10 Å wavelength bins preprocessing corrects for redshift, gappy regions normalized by median flux at certain wavelengths 6 4 2 0 3000 4000 5000 6000 7000 8000 9000 6 4 2 raw spectrum in observed frame 0 3000 4000 5000 6000 7000 8000 9000 6 4 raw spectrum in rest frame gap corrected spectrum in rest frame 2 0 3000 4000 5000 6000 7000 8000 9000 1 Also results in neuroscience as well as genetics and mass spec imaging. 8/33

Red vs. blue galaxies (a) red galaxy image (b) blue galaxy image (c) red galaxy spectrum (d) blue galaxy spectrum 9/33

Dimension reduction in astronomy An incomplete history: (Connolly et al., 1995): principal components analysis (Vanderplas & Connolly, 2009): locally linear embedding (Richards et al., 2009): diffusion maps regression (Yip, Mahoney, et al., 2013): CUR low-rank decompositions Here: Apply global/local spectral methods to astronomical spectra Address both nonlinear structure and nonuniform density Explore locally-biased versions of these embeddings for a downstream classification task 10/33

Constructing global diffusion embeddings (Belkin & Niyogi, 2003; Coifman & Lafon, 2006) Given data {x j } N j=1, form graph with edge weights W ij = exp ( x i x j 2 ), D ii = ε i ε j j W ij In practice, only add k nearest-neighbor edges Set ε i = distance to point i s k/2 nearest-neighbor The lazy transition matrix is M = 1 2 D 1/2 (D + W )D 1/2 Embedding given by leading non-trivial eigenvectors of M 11/33

Global embedding: effect of k Figure: Eigenvectors 3 and 4 of Lazy Markov operator, k = 2 : 2048 12/33

Global embedding: average spectra A5 scaled flux + offset A4 A3 A2 A1 3000 4000 5000 6000 7000 8000 9000 wavelength (10 10 m) R5 scaled flux + offset E5 scaled flux + offset R4 E4 R3 E3 R2 E2 E1 3000 4000 5000 6000 7000 8000 9000 wavelength (10 10 m) 3000 4000 5000 6000 7000 8000 9000 wavelength (10 10 m) R1 13/33

Optimization approach to global spectral methods Markov matrix related to combinatorial graph Laplacian L: L def = D W = 2D 1/2 (I M)D 1/2 Can write v 2, the first nontrivial eigenvector of L, as the solution to minimize x T Lx subject to x T Dx = 1 x T D1 = 0 Similarly for v t with additional constraints x T Dv j = 0, j < t. Theorem. Solution can be found by computing an eigenvector. It it is quadratically good (i.e., Cheeger s Inequality). 14/33

MOV optimization approach to local spectral methods (Mahoney, Orecchia, and Vishnoi, 2009; Hansen and Mahoney, 2013; Lawlor, Budavari, Mahoney, 2015) Suppose we have: 1. a seed vector s = χ S, where S is a subset of data points 2. a correlation parameter κ MOV objective. The first semi-supervised eigenvector w 2 solves: minimize x T Lx subject to x T Dx = 1 x T D1 = 0 x T Ds κ Similarly for w t with addition constraints x T Dw j = 0, j < t. Theorem. Solution can be found by solving a linear equation. It is quadratically good (with a local version of Cheeger s Inequality). 15/33

Local embedding: scale parameter and effect of seed For an appropriate choice of c and γ = γ(κ) < λ 2, one can show w 2 = c(l γd) + Ds = c(l G γl kn ) + Ds (In practice, binary search to find correct γ.) Figure: (left) Global embedding with seeds in black. (middle, right) Local embeddings using specified seeds. 16/33

Classification on global and local embeddings Try to reproduce 5 astronomer-defined classes Train multiclass logistic regression on global and local embeddings Seeds chosen to discriminate one class (e.g., AGN) vs. rest Figure: (left) Global embedding, colored by class. (middle, right) Confusion matrices for classification on global and local embeddings. 17/33

Outline Motivation 1: Social and information networks Motivation 2: Machine learning data graphs Local spectral methods in worst-case theory Local spectral methods to robustify graph construction

Local versus global spectral methods Global spectral methods: Compute eigenvectors of matrices related to the graph. Provide quadratically good approximations to the best partitioning of the graph (Cheeger s Inequality). This provide inference control for classification, regression, etc. Local spectral methods: Use random walks to find locally-biased partitions in large graphs. Can prove locally-biased versions of Cheeger s Inequality. Scalable worst-case running time; non-trivial statistical properties. Success stories for local spectral methods: Getting nearly-linear time Laplacian-based linear solvers. For finding local clusters in very large graphs. For analyzing large social and information graphs. 19/33

Two different types of local spectral methods Strongly-local methods: ST; ACL; C; AP: run short random walks. Theorem: If there is a small cluster near the seed node, they you will find it, otherwise you will stop; and running time depends on size of output, not the graph. You don t even touch most of the nodes in a large graph. Very good in practice, especially the ACL push algorithm. Weakly-local methods: MOV; HM: optimization objective with locality constraints. Theorem: If there is a small cluster near the seed node, they you will find it, otherwise you will stop; and running time depends on time to solve linear systems. You do touch all of the nodes in a large graph. Many semi-supervised learning methods have similar form. 20/33

The ACL push procedure 1. x (1) = 0, r (1) = (1 β) e i, k = 1 2. while any r j > τd j d j is the degree of node j 3. x (k+1) = x (k) + (r j τd j ρ) e j τd j ρ i = j 4. x (k+1) i = r (k) i + β(r j τd j ρ)/d j i j otherwise 5. k k + 1 r (k) i Things to note: This approximates the solution to the personalized PageRank problem: (I βad 1 ) x = (1 β) v; { (I βa) y = (1 β)d 1/2 v, where A = D 1/2 AD 1/2 x = D 1/2 y [αd + L] z = α v, where β = 1/(1 + α) and x = D z. The global invariant r = (1 β) v (I βad 1 ) x is maintained throughout, even though r and x are supported locally. Question: What does this algorithm compute approximately or exactly? 21/33

ACL theory, TCS style Informally, here is the ACL algorithm: Diffuse from a localized seed set of nodes. Maintain two localized vectors such that a global invariant is satisfied. Stop according to a stopping rule. Informally, here is the ACL worst-case theory: If there is a good conductance cluster near the initial seed set, then the algorithm will find it, and otherwise it will stop. The output satisfied Cheeger-like quality-of-approximation guarantees. The running time of the algorithm depends on the size of the output but is independent of the size of the graph. Note: This is an approximation algorithm. Question: What does this algorithm compute exactly? 22/33

Constructing graphs that algorithms implicitly optimize Given G = (V, E), add extra nodes, s and t, with weights connected to nodes in S V or to S. Then, the s, t-minimum cut problem is: min B x C,1 = x s =1,x t =0 The l 2-minorant of this problem is: min B x C,2 = x s =1,x t =0 or, equivalently, of this problem: 1 min x s =1,x t =0 2 B x 2 C,2 = 1 2 (u,v) E (u,v) E (u,v) E C (u,v) x u x v C (u,v) x u x v 2 C (u,v) x u x v 2 = 1 2 x T L x 23/33

Implicit regularization in the ACL approximation algorithm Let B(S) be the incidence matrix for the localized cut graph, for vertex subset S, and let C(α) be the edge-weight matrix. Theorem. The PageRank vector z that solves (αd + L) z = α v, with v = d S /vol(s) is a renormalized solution of the 2-norm cut computation: min x s=1,x t=0 B(S) x C(α),2. Theorem. Let x be the output from the ACL push procedure, set parameters right, and let z G be the solution of: 1 min z s=1,z t=0, z 0 2 B(S) z 2 C(α),2 + κ D z 1, where z = (1 z G 0) T. Then x = D z G /vol(s). 24/33

Outline Motivation 1: Social and information networks Motivation 2: Machine learning data graphs Local spectral methods in worst-case theory Local spectral methods to robustify graph construction

Problems that arise with explicit graph construction Question: What is the effct of noise or perturbations or arbitrary decisions in the construction of the graph on the output of the subsequent graph algorithm. Common problems with constructing graphs. Problems with edges/nonedges in explicit graphs (where arbitrary decisions are hidden from the user). Problems with edges/nonedges in constructed graphs (where arbitrary decisions are made by the user). Problems with labels associated with the nodes or edges (since the ground truth may be unreliable). 26/33

Semi-supervised learning and implicit graph construction Zhou et al. Andersen-Lang Joachims ZGL 3α 4α 5α 4α 5α α s 5α 3α α 4α 6α 5α 3α 5α 4α 5α t 2α 4α s 5α 3α t 5α 6α 5α 3α 5α 4α 5α 2α 1 s 1 1 1 t s t (a) Zhou (b) AL (c) Joachims (d) ZGL Figure: The s, t-cut graphs associated with four different constructions for semi-supervised learning on a graph. The labeled nodes are indicated by the blue and red colors. This construction is to predict the blue-class. To do semi-supervised learning, these methods propose a diffusion equation: Y = (L + αd) 1 S. This equation propagates labels from a small set S of labeled nodes. This is equivalent to the minorant of an s, t-cut problem where the problem varies based on the class. This is exactly the MOV local spectral formulation that ACL approximates the solution of and exactly solves a regularized version of. 27/33

Comparing diffusions: sparse and dense graphs, low and high error rates number of mistakes 100 80 60 40 20 0 20 40 60 number of labels (a) dense; low label error number of mistakes 150 100 50 0 5 10 number of labels (b) sparse; low label error number of mistakes 100 80 60 40 20 0 20 40 60 number of labels (c) dense; high label error number of mistakes 150 100 50 0 Joachims Zhou ZGL 5 10 number of labels (d) sparse; high label error 28/33

Case study with toy digits data set 0.4 0.3 Zhou Zhou+Push 0.4 0.3 Zhou Zhou+Push error rate 0.2 error rate 0.2 0.1 0.8 1.2 1.5 1.8 2.1 2.5 0.1 5 10 25 50 100 150 200 250 0 1 1.5 2 2.5 σ (e) Varying σ 0 10 2 nearest neighbors (f) Varying nearest neighbors Performance of the diffusions while varying the density by changing (a) σ or (b) varying r in the nearest neighbor construction. In both cases, making the graph denser results in worse performance. 29/33

Densifying sparse graphs with matrix polynomials Do it the usual way: Vary the kernel density width parameter σ. Convert the weighted graph into a highly sparse unweighted graph through a nearest neighbor construction. Do it by coundint paths of different lengths: Run the A k construction: given a graph with adjacency matrix A, the graph A k counts the number of paths of length up to k between pairs of nodes: A k = k A l. That is, oversparsify, compute A k for k = 2, 3, 4, and then do nearest neighbor construction. l=1 (BTW, this is essentially what local spectral methods do.) 30/33

Error rates on densified sparse graphs Neighs. Avg. Deg Neighs. k Avg. Deg 13 19.0 3 2 18.1 28 40.5 5 2 39.2 37 53.3 3 3 52.3 73 104.4 10 2 103.8 97 138.2 3 4 127.1 Table: Paired sets of parameters that give us the same non-zeros in a nearest neighbor graph and a densified nearest neighborgraph A k. Zhou Zhou w. Push Avg. Deg k = 1 k 1 k = 1 k 1 19 0.163 0.114 0.156 0.117 41 0.156 0.132 0.158 0.113 53 0.183 0.142 0.179 0.136 104 0.193 0.145 0.178 0.144 138 0.216 0.102 0.204 0.101 Table: Median error rates show the benefit to densifying a sparse graph with the A k construction. Using average degree of 138 outperforms all of the nearest neighbor trials from previous figure. 31/33

Pictorial illustration Avoidable errors Unavoidable errors Correct (g) Zhou on sparse graph (h) Zhou on dense graph (i) Zhou+Push on sparse graph (j) Zhou+Push on dense graph We artificially densify this graph to A k (for k = 2, 3, 4, 5) to compare sparse and dense diffusions and implicit ACL regularization. (Unavoidable errors are caused by a mislabeled node.) Things to note: On dense graphs, regularizing diffusions has smaller effect (b vs. d). In sparse graphs, regularizing diffusions has bigger effect (a vs. c). Regularized diffusions less sensitive to density changes than unregularized diffusions (c vs. d). 32/33

Conclusion Many real data graphs are very not nice: data analysis design decisions are often reasonable but somewhat arbitrary noise from label errors, node/edge errors, arbitrary decisions hurt diffusion-bsed algorithms in different ways Many preprocessing design decisions make data more nice: good when algorithm users do it it helps algorithms return something meaningful bad when algorithm developers do it algorithms don t get stress-tested on not nice data Local and locally-biased spectral algorithms: have very nice algorithmic and statistical properties can also be used to robustify the graph construction step to arbitrariness of data preprocessing decisions 33/33