SVD, discrepancy, and regular structure of contingency tables

Size: px

Start display at page:

Download "SVD, discrepancy, and regular structure of contingency tables"

Nigel Chapman
5 years ago
Views:

1 SVD, discrepancy, and regular structure of contingency tables Marianna Bolla Institute of Mathematics Technical University of Budapest (Research supported by the TÁMOP B-10/ project) European Meeting of Statisticians, Budapest July 21, 2013

2 Outline Motivation: to find relatively small number of groups of objects, belonging to rows and columns of a contingency table which exhibit homogeneous behavior with respect to each other and do not differ significantly in size; to make inferences on the separation that can be achieved for a given number of clusters, minimum normalized bicuts are investigated and related to the SVD of the correspondence matrix. Topics: singular value decomposition (SVD) of a correspondence matrix; SVD and normalized bicuts of the contingency table; volume-regular row column clusters pairs; application and possible extension to directed graphs.

3 References We partly extend the result of Butler, S., Using discrepancy to control singular values for nonnegative matrices, Lin. Alg. Appl. (2006) for estimating the discrepancy of a contingency table by the second largest singular value of the normalized table (one-cluster, rectangular case), and the result of B, M., Modularity spectra, eigen-subspaces, and structure of weighted graphs, European Journal of Combinatorics (2013) for estimating the constant of volume-regularity by the structural eigenvalues and the distances of the corresponding eigen-subspaces of the normalized modularity matrix of an edge-weighted graph.

4 SVD of contingency tables and correspondence matrices C = (c ij ): n m contingency table, c ij 0. Row set: Row = {1,..., n} Column set: Col = {1,..., m} d row,i = m c ij (i = 1,..., n) j=1 n d col,j = c ij (j = 1,..., m) i=1 D row = diag (d row,1,..., d row,n ) D col = diag (d col,1,..., d col,m ).

5 Representation For a given integer 1 k min{n, m}, we are looking for k-dimensional representatives r 1,..., r n of the rows and c 1,..., c m of the columns such that they minimize the objective function Q k = n m c ij r i c j 2 (1) i=1 j=1 subject to n d row,i r i ri T = I k, i=1 m d col,j c j c T j = I k. (2) j=1 When minimized, the objective function Q k favors k-dimensional placement of the rows and columns such that representatives of highly associated rows and columns are forced to be close to each other. As we will see, this is equivalent to the problem of correspondence analysis.

6 Solution X := (x 1,..., x k ) = (r T 1,..., rt n ) T n k Y := (y 1,..., y k ) = (c T 1,..., ct m) T m k subject to Q k = 2k tr (D 1/2 rowx) T C corr (D 1/2 Y) min col X T D row X = I k, Y T D col Y = I k, where C corr = D 1/2 row CD 1/2 col : correspondence matrix (normalized contingency table) belonging to the table C.

7 Representation theorem Let C corr = r i=1 s iv i u T i be SVD, where r min{n, m} is the rank of C corr, or equivalently (since there are not identically zero rows or columns), the rank of C and 1 = s 1 s 2 s r > 0. v 1 = ( d row,1,..., d row,n ) T and u 1 = ( d col,1,..., d col,m ) T. Let k r be a positive integer such that s k > s k+1. Then min Q k = 2k k i=1 and it is attained with X = D 1/2 row (v 1,..., v k ) and Y = D 1/2 col (u 1,..., u k ). s i

8 Normalized bicuts of contingency tables Given an integer k (0 < k r), we want to simultaneously partition the rows and columns into disjoint, nonempty subsets Row = R 1 R k, Col = C 1 C k such that the cuts c(r a, C b ) = i R a j C b c ij (a, b = 1,..., k) between the row-column cluster pairs be as homogeneous as possible. The normalized two-way cut of C given the above k-partitions P row = (R 1,..., R k ) and P col = (C 1,..., C k ) and the collection of signs σ: ν k (P row, P col, σ) = k ( k a=1 b=1 where 1 Vol(R a ) + 1 Vol(C b ) + 2σ ab δ ab Vol(Ra )Vol(C b ) ) c(r a, C b ),

9 Vol(R a ) = i R a d row,i, Vol(C b ) = j C b d col,j, δ ab is the Kronecker delta, σ ab = ±1, and σ = (σ 11,..., σ kk ). For the normalized bicut of C: min ν k(p row, P col, σ) 2k P row,p col,σ k s i. i=1

10 Special case: edge-weighted graph W: symmetric n n edge-weigh matrix. D = D row = D col, W corr = D 1/2 WD 1/2 : normalized modularity matrix, I W corr : normalized Laplacian. When the k 1 largest absolute value eigenvalues of W corr are all positive: r i = c i (i = 1,..., n = m). With the choice σ bb = 1 (b = 1,..., k), ν k (P row, P col, σ) is twice the normalized cut of the graph with respect to the k-partition P row = P col of the vertices. The normalized bicut favors k-partitions with low inter-cluster edge-densities: community structure. When the k 1 largest absolute value eigenvalues of the normalized modularity matrix are all negative: r i = c i and the minimum of the normalized bicut is attained with the choice σ bb = 1 (b = 1,..., k). The normalized bicut favors k-partitions with low intra-cluster edge-densities: anticommunity structure.

11 Regular row-column cluster pairs The Expander Mixing Lemma for edge-weighted graphs naturally extends to this situation (Butler): for all R Row and C Col c(r, C) Vol(R)Vol(C) s 2 Vol(R)Vol(C), where s 2 is the largest but 1 singular value of C corr. Since the spectral gap of C corr is 1 s 2, in view of the above Expander Mixing Lemma, large spectral gap is an indication of small discrepancy: the weighted cut between any row and column subset of the contingency table is near to what is expected in a random table.

12 Volume-regular cluster pairs We extend the notion of discrepancy to volume-regular pairs. Definition The row column cluster pair R Row, C Col of the contingency table C of total volume 1 is γ-volume regular if for all X R and Y C the relation c(x, Y ) ρ(r, C)Vol(X )Vol(Y ) γ Vol(R)Vol(C) c(r,c) holds, where ρ(r, C) = Vol(R)Vol(C) is the relative inter-cluster density of the row column pair R, C. We will show that for given k, if the clusters are formed via applying the weighted k-means algorithm for the optimal row- and column representatives, respectively, then the so obtained row column cluster pairs are homogeneous in the sense that they form equally dense parts of the contingency table.

13 Weighted k-variance The weighted k-variance of the k-dimensional row representatives is defined by k Sk 2 (X) = min d row,j r j r a 2, (R 1,...,R k ) a=1 j R a where r a = 1 Vol(R a) j R a d row,j r j is the weighted center of cluster R a (a = 1,..., k). Similarly, the weighted k-variance of the k-dimensional column representatives is k Sk 2 (Y) = min d col,j c j c a 2, (C 1,...,C k ) a=1 j C a where c a = 1 Vol(C a) j C a d col,j c j is the weighted center of cluster C a (a = 1,..., k). Observe, that the trivial vector components can be omitted, and the k-variance of the so obtained (k 1)-dimensional representatives will be the same.

14 Theorem Theorem Let C be a contingency table of n-element row set Row and m-element column set Col, with row- and column sums d row,1,..., d row,n and d col,1,..., d col,m, respectively. Suppose that n m i=1 j=1 c ij = 1 and there are no dominant rows and columns: d row,i = Θ(1/n), (i = 1,..., n) and d col,j = Θ(1/m), (j = 1,..., m) as n, m. Let the singular values of C corr be 1 = s 1 > s 2 s k > ε s i, i k + 1. The partition (R 1,..., R k ) of Row and (C 1,..., C k ) of Col are defined so that they minimize the weighted k-variances Sk 2 (X) and Sk 2 (Y) of the row and column representatives. Suppose that there are constants 0 < K 1, K 2 1 k such that R i K 1 n and C i K 2 m (i = 1,..., k), respectively. Then the R i, C j pairs are O( 2k(S k (X)S k (Y)) + ε)-volume regular (i, j = 1,..., k).

Application We applied the biclustering algorithm to find simultaneously clusters of stores and products based on their consumption in TESCO stores.

15 Application We applied the biclustering algorithm to find simultaneously clusters of stores and products based on their consumption in TESCO stores. We found 3 clusters of the stores in which the consumption of the products belonging to the same cluster was c(r homogeneous with consumption-density a,c b ) Vol(R a)vol(c b ) between store-cluster R a and product-cluster C b (a, b = 1,..., 3). After sorting the rows and columns according to their cluster c memberships, we plotted the entries ij d row,i d col,j (there was one exceptional store-cluster which contained only 3 stores, but the others could be identified with groups of smaller and larger stores associated with product groups of high consumption-density)

16 Directed graphs W: n n (not symmetric) weight matrix of a directed graph, where w ij is the weight of the i j edge (i, j = 1,..., n; i j), and w ii = 0 (i = 1,..., n). The generalized in- and out-degrees: n d out,i = w ij (i = 1,..., n) d in,j = j=1 n i=1 w ij (j = 1,..., n). D in = diag (d in,1,..., d in,n ) and D out = diag (d out,1,..., d out,n ) are the in- and out-degree matrices. Suppose that there are no sources and sinks (i.e. no zero out- and in-degrees). W corr = D 1/2 out WD 1/2 in, and its SVD is used to minimize the normalized bicut of W as a contingency table.

17 Regular in- and out-vertex cluster pairs The V in, V out in- and out-vertex cluster pair of the directed graph (with sum of the weights of directed edges 1) is γ-volume regular if for all X V out and Y V in the relation w(x, Y ) ρ(v out, V in )Vol out (X )Vol in (Y ) γ Vol out (V out )Vol in (V in ) holds, where the directed cut w(x, Y ) is the sum the weights of the X Y edges, Vol out (X ) = i X d out,i, Vol in (Y ) = j Y d in,j, and w(v ρ(v out, V in ) = out,v in ) Vol out(v out)vol in (V in ) is the relative inter-cluster density of the out in cluster pair V out, V in. The clustering (V in,1,..., V in,k ) and (V out,1,..., V out,k ) of the columns and rows guaranteed by the above theorem corresponds to in- and out-clusters of the same vertex set such that the directed information flow V out,a V in,b is as homogeneous as possible for all a, b = 1,..., k pairs. Emigration immigration patterns of countries.

18 Thank you for your attention

Finding normalized and modularity cuts by spectral clustering. Ljubjana 2010, October

Finding normalized and modularity cuts by spectral clustering Marianna Bolla Institute of Mathematics Budapest University of Technology and Economics marib@math.bme.hu Ljubjana 2010, October Outline Find