Overlapping Variable Clustering with Statistical Guarantees and LOVE

with Statistical Guarantees and LOVE Department of Statistical Science Cornell University WHOA-PSI, St. Louis, August 2017

Joint work with Mike Bing, Yang Ning and Marten Wegkamp Cornell University, Department of Statistical Science

Variable clustering What is variable clustering? Observable: X = (X 1,..., X j,..., X p ) R p, random vector. Data: X (1),..., X (n) i.i.d. X R p. Goal of variable clustering: Find sub-groups of similar coordinates of X, using Data. Goal different than data/point clustering: Find sub-groups of similar observations X (i), 1 i n. Data different than network clustering: Network data is 0/1 adjacency matrix.

Co-clustering genes using expression profiles ENSG00000272865 ENSG00000273487 ENSG00000273423 0.56 0.44 0.24 0.27 0.49 0.41 0.36 1 2 3 4 5 6 7 8 9 10 0.23

Model-based Objectives of model-based (overlapping) variable clustering Define model-based similarity between coordinates of X. Model definition depends crucially on what we want to cluster and what type of data we have: Here we cluster variables, and observe their values. Use identifiable model to define clusters of co-ordinates; allow for overlap. Estimate clusters; assess theoretically their accuracy, in the model-based framework.

A first step towards a model for overlapping clustering A sparse latent variable model with unstructured sparsity 1 X = AZ + E; A is a p K allocation matrix. 2 Z R K latent vector, E R p noise vector; Z E. 3 A is row sparse; K k=1 A jk 1, for each j {1,..., p}. Variable similarity and clusters X j and X l are similar if connected with the same Z k. Suggests definition for clusters with overlap: G k := { j {1,..., p} : A jk 0 }. Issue: model and clusters not identifiable: AZ = AQQ T Z, for any orthogonal Q. A jk may be 0, but (AQ) jk may not.

Identifiable models for overlapping clustering X = AZ + E; X R p, Z R K (Latent), A R p K. A sparse latent variable model with structured sparsity: The pure variable assumption (i) A row sparse; K k=1 A jk 1, for each j {1,..., p}. (ii) For every (column) k {1,..., K }, there exist at least two indices (rows) j {1,..., p} such that A jk = 1 and A jl = 0 for all l k. Spoiler alert! This A is identifiable up to signed permutations.

The pure variable assumption: interpretation Cluster: G k := { j {1,..., p} : A jk 0 }. The pure variable assumption A pure variable X j associates with only one latent factor Z k. Pure variables are crucial in building overlapping clusters Cluster G k is given by Z k, which is not observable. Z k = (biological) function. A pure variable X j is an observable proxy for a Z k. Observable X j performs function Z k. It anchors G k.

Overlapping clustering: interpretation An instance of determining unknown functions of variables Gene 1 (X 1 ) with function 1 anchors G 1. Gene 3 (X 3 ) with function 2 anchors G 2. Gene 2 (X 2 ) G 1 : Gene 2 performs function 1. Gene 2 (X 2 ) G 2 : Gene 2 also performs function 2. Before clustering Gene 2 had unknown function. After clustering Gene 2 is found to have dual function.

Identifiable models for overlapping clustering Ingredients for identifiability Latent variable model X = AZ + E with structure on A: the pure variable assumption. Mild assumption on C = Cov(Z ): ( (C) =: min min{cjj, C kk } C jk ) > 0. j k (C) > 0 = Z j Z k, a.s., for all j k.

Identifiability in structured sparse latent models The pure variable set I is identifiable I and its partition I =: {I k } 1 k K can be constructed uniquely, from Σ up to label permutations. The allocation matrix A is identifiable Under the pure variable assumption, there exists a unique matrix A, up to signed permutations, such that X = AZ + E. The clusters G = {G k } 1 k K are identifiable Under the pure variable assumption, the overlapping clusters G k are identifiable, up to label switching. If pure variables do not exist, identifiability of A fails.

Central challenge in proving identifiability X = AZ + E = Σ = ACA + Γ. I = Pure variable index set. J = {1,..., p} \ I = Impure variable index set. Central challenge: How to distinguish between I and J? Added challenge: How to distinguish between I and J when we don t know the noise Γ?

What is pure and what is impure? A necessary and sufficient condition for purity For each 1 i p, set M i := max j [p]\{i} Σ ij S i := { j [p] \ {i} : Σ ij = M i }. For given A and its induced pure variable set I, we have i I M i = max k [p]\{j} Σ jk for all j S i.

Look for maxima in Σ, ignore the diagonal! I = {{1, 2, 3}, {4, 5}, {6, 7}} and J = {8, 9} 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 0 0 0 2 0 0 1 1/3 0 0 0 2 0 0 1 1/3 0 0 0 0 0 3 0 1/2 0 0 0 0 0 3 0 1/2 1/2 1/2 1/2 1 1 0 0 1/6 2/3 2/3 2/3 1/3 1/3 1/2 1/2 1/6 M 1 = max k 1 Σ 1k = 1. S 1 = {j 1 : Σ 1j = 1} = {2, 3}. 1 = M 1 = max k 1 Σ 2k = 1. 1 = M 1 = max k 1 Σ 3k = 1. I 1 = S 1 1 = {1, 2, 3} = pure

Look for maxima in Σ, ignore the diagonal! I = {{1, 2, 3}, {4, 5}, {6, 7}} and J = {8, 9} 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 0 0 0 2 0 0 1 1/3 0 0 0 2 0 0 1 1/3 0 0 0 0 0 3 0 1/2 0 0 0 0 0 3 0 1/2 1/2 1/2 1/2 1 1 0 0 1/6 2/3 2/3 2/3 1/3 1/3 1/2 1/2 1/6 M 8 = max k 1 Σ 8k = 1. S 8 = {j 8 : Σ 8j = 1} = {4, 5}. 1 = M 8 max k 4 Σ 4k = 2. 8 cannot be pure! 8 J.

Estimation Estimate I. Estimate A I, the p K sub-matrix of A with rows in I. Estimate A J, with J = {1,..., p} \ I. Estimate the clusters G k.

Estimation of the pure variable set Reminder: I = pure variable set. Constructive characterization of I, population version For each 1 i p, set M i := max j [p]\{i} Σ ij S i := { j [p] \ {i} : Σ ij = M i }. For given A and its induced pure variable set I, we have i I M i = max k [p]\{j} Σ jk for all j S i. Moreover, S i {i} = I k, for some k, where I =: {I k } 1 k K is a partition of I.

Estimation of the pure variable set Algorithm idea Use the constructive characterization of I, at the population level. Replace Σ by the sample covariance Σ. Replace equalities by inequalities, allowing for tolerance level δ =: Σ Σ. Algorithm has complexity O(p 2 ). Requires input Σ, δ. Algorithm returns Î, its partition Î, and therefore K.

Estimation of the allocation submatrix A I A = [ AI Estimated previously: Î and its partition Î = {Î1,..., Î K }. The estimator ÂÎ, has rows i Î consisting of K 1 zeros and one entry equal to either +1 or 1. A J ]. Signs will be determined up to signed permutations. (1) Pick i Îk. Pick a sign for Âik, say Âik = 1. (2) For any j Îk \ {i}, we let { +1, if Â jk = Σ ij > 0 1, if Σ ij < 0.

Estimation of the allocation sub-matrix A J [ ] [ ΣII Σ Σ = IJ AI CA = T I A I CA T J Σ JI Σ JJ A J CA T I A J CA T J ] [ ] ΓII 0 +. 0 Γ JJ Estimate A J row by row: motivation Σ IJ = A I CA T J θ j = CA j, for each j J. C kk =: 1 I k ( I k 1) θ j k =: 1 A ik Σ ij I k i I k i,j I k,i j Σ ij ; C km =: and 1 I k I m i I k,j I m Σ ij

Estimation of the allocation sub-matrix A J Estimation of rows of A J Under the model: θ j = CA j ; A j sparse. Available: Σ and estimated partition of pure variables Î Use Σ and Î to construct: θ j θ j, Ĉ C. Many choices to estimate sparse A j. Dantzig: Minimize β 1 over β R K such that θ j Ĉβ 2δ. Repeat for each j {1,..., Ĵ } to obtain ÂĴ.

Statistical guarantees: assumptions Recall: Σ = ACA T + Γ; X sub-gaussian: Σ Σ =: δ = O( (log p)/n). Signal strength conditions 1 Either on C: (C) = min j k ( min{cjj, C kk } C jk ) cδ. 2 Or on A: Smallest non-zero entry is larger than δ.

Estimation of the pure variable set: guarantees I = pure variables J = {1,..., p} \ I. Quasi-pure variables For each k [K ] : J k 1 = { i J : A ik 1 4δ/τ 2}. J 1 = K k=1 Jk 1. If X 1 is pure then A 1k = 1, for some k [K ]. If X 2 is quasi-pure then A 2m 1, for some m [K ].

Estimation of the pure variable set: guarantees J1 k = {i : A ik 1}; J 1 =: K k=1 Jk 1 ; I k = {i : A ik = 1}. Recovery guarantees: no signal strength conditions on A (a) K = K. (b) I Î I J 1. (c) I k Îk I k J k 1 w.h.p. for each k [K ]. Minimal recovery mistakes: no conditions on A Pure (1, 0, 0, 0, 0, 0) In, correct. Quasi Pure (0.99, 0.01, 0, 0, 0, 0) In, slight mistake. Impure (0.25, 0.25, 0.001, 0.099, 0.2, 0.2) Out, correct.

Estimation of the pure variable set: guarantees Exact recovery, under conditions on A Î = I, up to label switching, with I = K a=1 I a. Exact recovery: conditions on A Pure (1, 0, 0, 0, 0, 0) In, correct. Quasi Pure (0.99, 0.01, 0, 0, 0, 0) Not allowed Impure I (0.25, 0.25, 0.001, 0.099, 0.2, 0.2) Not allowed Impure II (0.25, 0.25, 0.1, 0.1, 0.3, 0.2) Out, correct.

Estimation of the allocation matrix A: guarantees Sup-norm consistency Let H denote the space of all K by K signed permutation matrices. We have, with probability exceeding 1 c 1 p c 2, 1 K = K. 2 min P H Â PA κ log p/n; κ =: C 1,1. Non-standard bound; similar to errors-in-variables model bounds. If C is diagonally dominant then κ is constant.

Activation and inhibition +1 0 0 1 0 0 0 1 0 0 1 0 A = 0 0 +1, Â = 0 0 +1 +1/2 1/2 0 +1/2 1/6 1/6 1 0 0 +1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1/2 1/2 0 2/3 1/6 +1/6 Care in interpreting the signs: For each latent factor Z k we can consistently determine which of the X j s are associated with Z k in the same direction, but not the direction.

Estimation of the overlapping groups Ĝ = { } Ĝ 1,..., Ĝ K, Ĝk = { i : Âik 0 }. i,k FPR = 1 {A ik =0,Âik 0} i,k i,k 1, FNR = 1 {A ik 0,Âik =0} {A ik =0} i,k 1. {A ik 0} Guarantees for cluster recovery Under conditions on C: Under conditions on A: All results hold w.h.p. K = K ; FPR = 0; FNR = β. K = K ; FPR = 0; FNR = 0; Ĝ = G.

Sparsity per row: s j = A j 0 = k 1{A jk 0} J 1 = Quasi pure variables J 2 = Variables associated with some Z s below noise J 3 = Variables associated with all Z s above noise j J FNR = β 1 J 2 s j j J 1 J 2 s j + j J 3 I s. j If J 3 + I >> J 1 + J 2, β is very small.

LOVE A Latent model approach to OVErlapping clustering. Estimate the partition I of pure variables by Î Estimate separately A I and A J to obtain Â, the allocation matrix estimate. Estimate the overlapping clusters by Ĝ

Co-clustering genes using expression profiles: p = 500 Benchmark data set: RNA-seq transcript level data; Blood platelet samples from n = 285 individuals. ENSG00000273487 and ENSG00000272865 both non-coding RNA: placed together in Cluster 4. Each also placed in other clusters. Non-coding RNAs are pleiotropic (multiple functions). ENSG00000272865 ENSG00000273487 ENSG00000273423 0.56 0.41 0.24 0.27 0.44 0.49 0.36 1 2 3 4 5 6 0.23 7 8 9 10

Related work Large literature on Non-Negative Matrix Factorization (NMF) X = AZ + E; X, A, Z non-negative matrices. Goal of NMF: find Ã and Z with X Ã Z ɛ. In NMF, the pure variable assumption is needed for: Identifiability of A, when E = 0 (Donoho and Stodden, 2007). Identifiability in topic models (count data), Arora et al (2013): column sums of X and A are 1; E = 0. Polynomial time NMF algorithms: Arora et al (2012, 2013); Bittorf et al (2013). Other restrictions on matrices needed.

What can you do with LOVE? All you need is LOVE 1 A flexible identifiable latent factor model for overlapping clustering: no restrictions on X and Z. 2 New in the clustering lit: A has both + and entries. 3 New: A and clusters identifiable in the presence of non-ignorable noise E. 4 New algorithm: LOVE, runs in O(p 2 + pk ) time. 5 New: Statistical guarantees for data generated from X = AZ + E, with X sub-gaussian; immediate extensions to Gaussian copula. 6 New: A with both + and - allows for a more refined cluster interpretation.

with Statistical Guarantees (2017); F. Bunea, Y. Ning, M. Wegkamp https://arxiv.org/abs/1704.06977 [ Old version; new version coming soon!] Minimax Optimal Variable Clustering in G-models via Cord (2016); F. Bunea, C. Giraud, X. Luo https://arxiv.org/abs/1508.01939 [ Non-overlapping clustering] PECOK: a convex optimization approach to variable clustering(2016); F. Bunea, C.Giraud, M. Royer, N. Verzelen https://arxiv.org/abs/1606.05100 [ Non-overlapping clustering]

Thanks!