A spectral clustering algorithm based on Gram operators

Size: px

Start display at page:

Download "A spectral clustering algorithm based on Gram operators"

Barbara Ramsey
5 years ago
Views:

1 A spectral clustering algorithm based on Gram operators Ilaria Giulini De partement de Mathe matiques et Applications ENS, Paris Joint work with Olivier Catoni 1 july 2015

3 Clustering task of grouping objects into classes (clusters) according to their similarities Spectral clustering algorithms use data-dependent matrices to perform clusters

4 Similarity graph Assume notion of similarity (affinity matrix) A = (a ij ) (symmetric) where a ij 0 measures the similarity between X i and X j Represent data points in a similarity graph G = (V, E) where V = {X 1,..., X n } set of vertices E V V set of edges the edge between X i and X j is weighted by a ij

5 Graph partitioning Goal Find a partition of the graph such that edges between different groups have a low weight (dissimilar points) edges within a group have high weights (similar points)

6 Graph (bi-)partitioning: cut Find the partition that minimizes cut(s, S c ) = i S,j S c a ij efficient algorithms to solve it [M.Stoer, F.Wagner] Problem it tends to separate one single vertex from the rest of the graph

7 Graph partitioning: Ncut [Shi and Malik] Find the partition that minimizes ( 1 Ncut(S, S c ) = vol(s) + 1 ) vol(s c cut(s, S c ) ) where vol(s) = i S,j V a ij cut(s, S c ) = i S,j S c a ij Problem NP-hard Spectral clustering is a way to solve a relaxation of Ncut

8 Graph partitioning: Ncut Minimizing Ncut is equivalent to min S v (D A)v subject to Dv 1, v Dv = vol(v) where v R n such that v i = vol(s c ) vol(s) 1 [i S] + A = (a ij ) affinity matrix D = diag(d 1,..., d n ) where d i = j V a ij vol(v) = i V d i 1 = (1,..., 1) vol(s) vol(s c ) 1 [i S c ]

9 Relaxation Ncut min v R v (D A)v subject to Dv 1, v Dv = vol(v) n Define Laplacian matrix L = D 1/2 AD 1/2 Relaxation Ncut is equivalent to (u = D 1/2 v) min u R n u (I L)u subject to u D 1/2 1, u 2 = vol(v) where (I L) D 1/2 1 = 0 Solution u = 2nd smallest eigenvector of I L

10 c partitioning Compute the c smallest eigenvectors of I L c largest eigenvectors of L

11 Ng, Jordan, Weiss algorithm Let X 1,..., X n points to cluster and c the number of classes 1. Form a ij = { exp( X i X j 2 /2σ 2 ) if i j 0 otherwise 2. Construct L = D 1/2 AD 1/2 where D ii = j a ij 3. Compute largest eigenvectors v 1,..., v c of L ] and form T = [v 1... v c n c 4. Renormalize T and form Y ij = T ij /( j T2 ij )1/2 5. Treat each row of Y as a vector in R c 6. Cluster points according to the new representation

12 Continuous counterpart Let K(x, y) = exp( x y 2 /2σ 2 ) Consider L = D 1/2 AD 1/2 as empirical version of the integral operator with kernel ( K(x, y) = (P unknown) 1/2 ( K(x, z)dp(z)) K(x, y) ) 1/2 K(y, z)dp(z) References 1. [U. von Luxburg, M. Belkin, O. Bousquet] 2. [L. Rosasco, M. Belkin, E. DeVito]

13 Gram operators Recall ( K(x, y) = 1/2 ( K(x, z)dp(z)) K(x, y) ) 1/2 K(y, z)dp(z) By the Moore-Aronszajn theorem K(x, y) = φ(x), φ(y) H Define the Gram operator on H Gv = v, φ(z) H φ(z) dp(z)

14 Ideal algorithm Let K(x, y) = exp ( β x y 2) 1. Form K(x, y) = φ(x), φ(y) H 2. Construct K m(x, y) = G m 1 2 φ(x), G m 1 2 φ(y) H 3. Renormalize to obtain K m(x, y) = K m(x, x) 1/2 K m(x, y) K m(y, y) 1/2 4. Cluster points according to the new representation Goal: Construct an empirical algorithm

15 Step 1 ( K(x, y) = 1/2 ( K(x, z)dp(z)) K(x, y) ) 1/2 K(y, z)dp(z) Let ˆµ(x) be any estimator of µ(x) = K(x, z)dp(z) an estimator of K(x, y) is ˆK(x, y) = ˆµ(x) 1/2 K(x, y)ˆµ(y) 1/2

16 By the Moore-Aronszajn theorem ˆK(x, y) = ˆφ(x), ˆφ(y) Ĥ Replacing K(x, y) with ˆK(x, y) ˆK m (x, y) = Ĝ m 1 2 ˆφ(x), Ĝ m 1 2 ˆφ(y) Ĥ where Ĝv = v, ˆφ(z) Ĥ ˆφ(z) dp(z) (still unknown!) Estimate of Gram operators

17 Estimate of Gram operators Goal Estimate Gθ = θ, v H v dp(v), θ H from X 1,..., X n i.i.d. P Related problem Estimate the quadratic form Gθ, θ H = θ, v 2 H dp(v) Idea: Step 1 finite dimension non-asymptotic dimension-free bounds Step 2 generalization to infinite dimension

18 Finite-dimensional case Define The Gram matrix G = E[XX ] X R d P Goal: Estimate the quadratic form θ Gθ = E[ θ, X 2 ], θ R d from X 1,..., X n i.i.d. P Classical empirical estimator 1 n n θ, X i 2 law of large numbers E[ θ, X 2 ] n i=1 Robust estimator

19 To reduce the influence of the tail of the distribution of θ, X 2 r λ (θ) = 1 n n ψ ( θ, X i 2 λ ) λ > 0 i=1 where log ( ) ( ) 1 t + t2 ψ(t) log 1 + t + t2 2 2 t R

20 Truncate version of the empirical estimator Introduce r λ (θ) = 1 n n ψ ( θ, X i 2 λ ) i=1 ˆα θ = sup{α R + r λ (αθ) 0} where r λ (ˆα θ θ) = 0 Estimator linked to λ/ˆα 2 θ

21 Use a PAC-Bayesian approach to construct a confidence region with probability 1 2ɛ, θ S d, B (λ/ˆα 2 θ ) θ Gθ B + (λ/ˆα 2 θ ) Optimal confidence region B (θ) θ Gθ B +(θ) Define as an estimator of G } Ĝ = arg min { H F H = H, B (θ) θ Hθ B +(θ), θ Θ δ with Θ δ any finite δ-net of S d

22 Proposition Notation N(θ) = θ Gθ Let κ = sup θ E[ θ,x 4 ] E[ θ,x 2 ] 2 < + With probability 1 2ɛ, θ S d, N(θ) θ µ ( N(θ) ) Ĝθ 2 max{n(θ), σ} ( ( )) 1 4µ N(θ) + +7δ tr(g 2 )+σ where if n ( ) 2.023(κ 1) a tr(g) µ(n(θ)) = n max{n(θ), σ} + b + cκ tr(g) log(ɛ 1 ) + n max{n(θ), σ} extension to any Hilbert space assuming tr(g) < +

23 Empirical results sample in R 10 of size n = 100 drawn according to a Gaussian mixture distribution Left: projection onto the two first coordinates Right: projection onto the 2nd and 3rd coordinates

24 Empirical results (approximation errors) 500 empirical errors sorted in increasing order Figure: Ĝ G 2 F, Ḡ G 2 F

25 Infinite-dimensional case Let (H k ) k H increasing sequence of subspaces dim(h k ) < + and k H k = H By a continuity argument, with probability 1 2ɛ, θ S H, B (θ) Gθ, θ H B +(θ) Notation: V k = span{π k X 1,..., Π k X n } dim(v k ) < + Ĝ k : V k V k such that tr(ĝ2 k ) tr(g2 ) and B (θ) Ĝkθ, θ H B +(θ) θ Θ δ S H V k Define Q = Ĝk Π Vk

26 Proposition Notation: N(θ) = Gθ, θ H Let κ = sup θ E[ θ,x 4 H] E[ θ,x 2 H] 2 < + With probability at least 1 2ɛ, θ S H, µ ( N(θ) ) N(θ) Qθ, θ H 2 max{n(θ), σ} ( ( )) 1 4µ N(θ) + +7δ tr(g 2 )+σ+v k where v k 0 as k + and if n ( ) 2.023(κ 1) a tr(g) µ(n(θ)) = n max{n(θ), σ} + b + cκ tr(g) log(ɛ 1 ) + n max{n(θ), σ}

27 Final representation K m (x, y) = K m (x, x) 1/2 K m (x, y) K m (y, y) 1/2 In such a way smallest eigenvalues are killed natural dimensionality reduction automatic estimate of the number of classes Moreover clusters are sent at the vertices of a simplex

28 Empirical results (n = 900)

29 Empirical results

30 Empirical results

31 Empirical results

32 Empirical results

33 Empirical results

34 Bibliography I. Giulini, Generalization bounds for random samples in Hilbert spaces, PhD Thesis O. Catoni, Estimating the Gram matrix through PAC-Bayes bounds, preprint. O. Catoni, Challenging the empirical mean and empirical variance: a deviation study, Ann. Inst. H. Poincaré Probab. Statist. Vol. 48, No 4 (2012). A. Ng, M. Jordan, Y. Weiss. On spectral clustering: Analysis and an algorithm, Advances in Neural Information Processing Systems (2001). L. Rosasco, M. Belkin, E. De Vito, On learning with integral operators, J. Mach. Learn. Res. (2010). J. Shi, J. Malik, Normalized cuts and image segmentation, IEEE Trans. Pattern Anal. Mach. Intell 22,8 (2000) M. Stoer, F. Wagner, A simple min-cut algorithm, J. ACM (1997). U. von Luxburg, M. Belkin, O. Bousquet, Consistency of spectral clustering, Ann. Statist. (2008).

Thèse de doctorat. docteur de l école normale supérieure. Estimation statistique dans les espaces de Hilbert

Thèse de doctorat. docteur de l école normale supérieure. Estimation statistique dans les espaces de Hilbert Thèse de doctorat En vue de l obtention du grade de docteur de l école normale supérieure École doctorale 386 de sciences mathématiques de Paris-Centre Spécialité : Mathématiques Estimation statistique