Fitting a Graph to Vector Data. Samuel I. Daitch (Yale) Jonathan A. Kelner (MIT) Daniel A. Spielman (Yale)

Size: px

Start display at page:

Download "Fitting a Graph to Vector Data. Samuel I. Daitch (Yale) Jonathan A. Kelner (MIT) Daniel A. Spielman (Yale)"

Mary Walton
5 years ago
Views:

1 Fitting a Graph to Vector Data Samuel I. Daitch (Yale) Jonathan A. Kelner (MIT) Daniel A. Spielman (Yale) Machine Learning Summer School, June 2009

2 Given a collection of vectors x 1,..., x n IR d 1,, n Find a natural weighted graph on V = {1,, n} that is 1. Theoretically well motivated 2. Helps solve ML problems on the vectors 3. Has interesting combinatorial properties 4. Efficiently computable 5. Sparse

3 Given a collection of vectors x 1,..., x n IR d 1,, n Find a natural weighted graph on V = {1,, n} that is 1. Theoretically well motivated 2. Helps solve ML problems on the vectors 2. Helps solve TCS problems on the vectors 3. Has interesting combinatorial properties 4. Efficiently computable 5. Sparse

4 Outline Standard graphs Motivation for ours Laplacian view Combinatorial properties (Sparsity & Planarity) Use in classification, regression and clustering Efficient computation Approximate sparsity Lovasz sgraphs Open Questions

5 Standard ways to make graphs Choose one from each column Choice of edges k nearest neighbors neighbors if distance less than threshold δ Weights of edges unweighted exp ³ 2 kx i x j k 2 /2σ

6 Our first proposal Choose weights to minimize nx d ix i X w i,j x j j i i=11 j i 2 w i,j = w j,i where the weighted degrees satisfy d i = X j i w i,j 1

7 Our first proposal.67 Choose weights to minimize nx d ix i X w i,j x j j i i=11 j i w i,j = w j,i.67 where the weighted degrees satisfy d i = X j i w i,j 1

8 Motivation Consider regression problem solving for y i given {y j : j 6= i} y i = f(x i ) Given a weighted graph, natural guess is 1 X d i j i w i,j y j Sum of squares of errors when leave out each iis nx y i 1 X w i,j y j d i i=1 j i 2

9 Motivation Try to get right on coordinate vectors, x i = (x 1..., x d i, i ) Sum of squares of errors over d coordinate vectors dx k=11 i=11 nx x k i 1 X w i,j x k j d i i j i nx 1 = X i=1 x X i w i,j x j di j i 2 2

10 Motivation Try to get right on coordinate vectors, x i = (x 1..., x d i, i ) Sum of squares of errors over d coordinate vectors dx k=11 i=11 nx x k i 1 X w i,j x k j d i i j i nx 1 = X i=1 x X i w i,j x j di j i 2 Choose edge weights to minimize i i this. 2

11 Motivation Try to get right on coordinate vectors, x i = (x 1..., x d i, i ) Sum of squares of errors over d coordinate vectors dx k=11 i=11 nx x k i 1 X w i,j x k j d i i j i nx 1 = X i=1 x X i w i,j x j di j i 2 Ch d i ht t i i i thi Choose edge weights to minimize this. But, leads to a non-convex problem, and 2

12 Motivation, tricky example Consider minimizing 2 nx x i 1 X w i,j x j d i i=1 j i

13 Motivation, tricky example Consider minimizing 2 nx x i 1 X w i,j x j d i i=1 j i

14 Motivation, tricky example Consider minimizing 2 nx x i 1 X w i,j x j d i i=1 j i

15 Motivation, tricky example Consider minimizing 2 nx x i 1 X w i,j x j d i i=1 j i

16 Motivation, tricky example Consider minimizing 2 nx x i 1 X w i,j x j d i i=1 j i 1 1 ² ² ² 1 1 ² ² Just better as ² goes to 0 1

17 Motivation, fixing bad example Sum squares of errors, by weighted degree n nx x i 1 X w i,j x j d i i=1 j i 2 X n i=1 d ix i X w i,j x j j i 2

18 Motivation, fixing bad example Sum squares of errors, by weighted degree n nx x i 1 X w i,j x j d i i=1 j i 2 X n i=1 To avoid degeneracy, require either d ix i X w i,j x j j i 2 or d i = X j i w i,j 1 (hard graph) j i X (1 di ) 2 αn i:d i <1 (α soft graph)

19 Example Random points in 2 dimensions (0.1 soft)

20 Example Two Clusters (0.1 soft)

21 Example (0.1 soft)

22 Unique? Not necessarily: Not unique on highly symmetric point sets. (will see why later) But, almost always unique. Conjecture: Unique with probability 1 after infinitesimally small random perturbation.

23 Related to Locally linear embeddings [Roweis Saul Chooses allowable neighbors, then chooses asymmetric weights to minimize the same objective function. K means If restrict graph to be a union of complete graphs, one for each cluster, are minimizing same objective function.

24 In terms of the graph Laplacian L = D A where A is the weighted adjacency matrix, and D is the diagonal matrix of degrees

25 In terms of the graph Laplacian L = D A where A is the weighted adjacency matrix, and D is the diagonal matrix of degrees XnX 2 i=1 d ix i X w i,j x j = klxk 2 F j i L is linear in edge weights, so is quadratic klxk 2 F (term rejected was ) D 1 LX 2 F

26 Sparse Solution Can fix LX, or difference vectors i = d i x i X w i,j x j = X w i,j (x i x j ) j i j i and weighted degrees: d = U w Imposing only n(d+1) constraints on w. If more than n(d+1) non zero entries, can make one zero, holding all these fixed, keeping w non negative.

27 Sparsity Theorem: For n vectors in d dimensions, always is a graph with at most (d+1)n edges, average degree 2(d+1). Theorem (later) For n vectors in any dimension, is an ² approx solution with average degree O(1/² 2 ).

28 Degrees on real data Data set Type n dim ave deg (hard) ave deg (soft) abalone regression glass 6 classes heart 2 classes housing regression ionosphere 2 classes iris 3 classes machine regression mpg regression pima 2 classes sonar 2 classes vehicle 4 classes vowel 11 classes wine 3 classes

29 Degrees on real data Data set Type n dim ave deg (hard) ave deg (soft) abalone regression glass 6 classes heart 2 classes housing regression ionosphere 2 classes iris 3 classes machine regression mpg regression pima 2 classes sonar 2 classes vehicle 4 classes vowel 11 classes wine 3 classes

30 Example for which graphs are not unique Consider set of vectors in {0,1} d of even parity. Sum of any two solutions is a solution, so sum over all isomorphisms of the point set Every vertex gets degree at least µ d 2 By previous theorem, is a solution of average degree at most 2(d +1) Cannot have symmetry and sparsity

31 Planarity For every set of vectors in 2 dimensions, there is a hard and α soft graph that is planar no crossings

32 Planarity For every set of vectors in 2 dimensions, there is a hard and α soft graph that is planar no crossings And, no point contained in a triangle., p g In general dimension, is a simplicial complex.

33 Planarity, proof A graph with maximum sum of weighted ihtddegrees has no crossings. increase weights outside decrease weights on crossing edges + Will fix LX and thus klxk 2 F but tincreases all weighted ihtddegrees

34 Planarity, proof x 0 +tα 0 β 0 y 0 +tα 1 β 0 +tα 0 β 1 z tα 0 α 1 tβ 0 β 1 x 1 z = α 0x 0 + α 1 x 1, z = β 0 y 0 + β 1 y 1, α 0 + α 1 = 1 β 0 + β 1 =1 y 1 +tα 1 β 1 Will fix LX and thus klxk 2 F but tincreases all weighted ihtddegrees

35 Classification and Regression Experiments Given a set S of vectors with labels, compute z minimizing y i IR X wi,j (z i z j ) 2 = z T Lz i j subject to z i = y i, for i S [ZGL 03] With 2 clusters, use an indicator vector for one. With k clusters, use k indicator vectors.

36 Classification Experiments Used 10 fold cross validation We had no parameters to train. For other means of choosing graphs, chose their parameters by gridding over choices, and evaluating by leave one out cross validation. Repeated 100 times

37 Classification experiments Data set n dim classes 0.1 soft knn thresh glass heart ionosphere iris pima sonar vehicle vowel wine tried unweighted and exponential weights varying σ tried unweighted, and exponential weights varying σ varied k for knn, and δ for threshold

38 Classification experiments Data set n dim classes 0.1 soft knn thresh libsvm glass heart ionosphere iris pima sonar vehicle vowel wine For libsvm [ChangLin] For libsvm [ChangLin], chose params by 10 fold CV on training dat

39 Regression Error Data set n dim hard 0.1 soft knn thresh epsilon svr abalone bl Too long housing machine mpg Data normalized to have variance 1. 2 fold CV on abalone, 10 fold otherwise

40 Clustering Experiments (0.1 soft) Dt Data set n dim k k means NJW Nvec = k glass heart ionosphere iris pima sonar vehicle Vowel wine NJW: Ng Jordan Weiss, as implemented in Spider. Use graph, run k means in Nvec eigenvectors.

41 Clustering Experiments (0.1 soft) Dt Data set n dim k k means NJW Nvec = k Nvec chosen glass (12) heart (2) 0.19 ionosphere (15) iris (8) pima (12) sonar (35) vehicle (6) Vowel (30) wine (3) NJW: Ng, Jordan, Weiss, as implemented in Spider. Use graph, run k means in Nvec eigenvectors.

42 Choosing number vectors for spectral pro ojectio on Vector number (by λ) Vector number (by λ) Project X into eigenvectors. Take vectors until all small coefficient. (Still fuzzy)

43 Quadratic Program for Edge Weights L = UWU T V Signed vertex edge adjacency matrix 1-1 U Diagonal matrix of edge weights ih E

44 Objective function is quadratic klxk 2 F = dx Lx k 2 = dx k=1 k=1 = = = UWU T x k 2 XdX UWy k 2 k=1 XdX UY UY w (k) k=1 2 UY (1). w UY (d) w is vector of edge weights y k = U T x k 2 Y (k) = diag(y k )

45 Soft graph program 2 Minimize M = Subject to kmwk 2 nx (max(1 d i, 0) 2 ) αn i=1 w 0 UY (1).. UY (d) d = U w

46 Soft graph program 2 Minimize M = Subject to kmwk 2 nx (max(1 d i, 0) 2 ) αn i=1 w 0 UY (1).. UY (d) d = U w Minimize n kmwk 2 + μ X i=1(max(1 d i, 0) 2 ) Searching over μ until nx (max(1 d i, 0) 2 )=αn i=1

47 Soft graph program, as NNLSQ n Minimize kmwk 2 + μ X d i, 0) i=1(max(1 2 ) s Minimize kmwk 2 + μ k1 + s U wk 2 w, s 0 A non negative least squares problem.

48 Computing Soft graph quickly Minimize kmwk 2 + μ k1 + s U wk 2 w, s 0 Look for solution among a subset of edges Given solution restricted to subset of edges check gradient of obj func wrt unused edges if all negative, finished otherwise, include some and solve again.

49 Computation Time (secs) Data set Type n dim soft time hard time abalone regression glass 6 classes heart 2 classes housing regression ionosphere 2 classes iris 3 classes machine regression mpg regression pima 2 classes sonar 2 classes vehicle 4 classes vowel 11 classes wine 3 classes

50 Approximate Sparse Solutions G e with average degree at most 16/² 2 For every graph G exists a sparse graph So that for all x, x T L 2 x x T L2 x ² x T Lx klkkxk If not too small, klxk 2 F elx 2 F Leave one out regression scores similiar

51 Approximate Sparse Solutions G e with average degree at most 16/² 2 For every graph G exists a sparse graph So that for all x, x T L 2 x x T L e 2 x ²(x T Lx) klkkxk Want x T L 2 x x T L e 2 x ²x T L 2 x

52 Sparsification Theorem [Batson S Srivastava 09] For every graph G exists a sparse graph e G with at most 4n/² 2 edges Such that for all x, (1 ²)x T e Lx x T Lx (1 + ²)x t e Lx

53 Sparsification Theorem [Batson S Srivastava 09] For every graph G exists a sparse graph e G with at most 4n/² 2 edges Such that for all x, (1 ²)x T e Lx x T Lx (1 + ²)x t e Lx Edges Time O(n log n/² 2 ) (linsolve) log n/² 2 [S Srivastava 08] O(n log 29 n/² 2 ) O(m log 17 n/² 2 ) [S Teng 04]

54 Lovasz s Graphs Steinitz it representations tti and the Colin de Verdiere number, Form a matrix M = P A, P is positive diagonal A is non negative (adjacency) MX = 0 (in fact, null(m) = X) M has one negative eigenvalue Works if points are on convex hull of polytope. A non zero only for edges of polytope.

55 Open Questions Are there other natural and useful graphs? Are there interesting ways to modify/parameterize our construction? Are our graphs connected? Are our graphs unique (under perturbation)? Do there exist sparse approximate solutions in all dimensions? Can one sparsify for the square of the Laplacian? How to interpolate off the graph, or add points?

Spectral and Electrical Graph Theory. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

Spectral and Electrical Graph Theory Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity Outline Spectral Graph Theory: Understand graphs through eigenvectors and