Overlapping Variable Clustering with Statistical Guarantees and LOVE

Similar documents
1 Multiply Eq. E i by λ 0: (λe i ) (E i ) 2 Multiply Eq. E j by λ and add to Eq. E i : (E i + λe j ) (E i )

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2018

Reconstruction from Anisotropic Random Measurements

Matrices and Vectors

sparse and low-rank tensor recovery Cubic-Sketching

LU Factorization. LU factorization is the most common way of solving linear systems! Ax = b LUx = b

DEN: Linear algebra numerical view (GEM: Gauss elimination method for reducing a full rank matrix to upper-triangular

Robust Principal Component Analysis

LU Factorization. LU Decomposition. LU Decomposition. LU Decomposition: Motivation A = LU

Direct Methods for Solving Linear Systems. Simon Fraser University Surrey Campus MACM 316 Spring 2005 Instructor: Ha Le

High Dimensional Inverse Covariate Matrix Estimation via Linear Programming

An algebraic perspective on integer sparse recovery

CS 246 Review of Linear Algebra 01/17/19

Spectral Clustering. Guokun Lai 2016/10

Gaussian Elimination for Linear Systems

Estimation of large dimensional sparse covariance matrices

8.1 Concentration inequality for Gaussian random matrix (cont d)

19.1 Problem setup: Sparse linear regression

Numerical Linear Algebra

Program Lecture 2. Numerical Linear Algebra. Gaussian elimination (2) Gaussian elimination. Decompositions, numerical aspects

High-dimensional covariance estimation based on Gaussian graphical models

Math 471 (Numerical methods) Chapter 3 (second half). System of equations

Computational Linear Algebra

Sparse Legendre expansions via l 1 minimization

Solving Linear Systems Using Gaussian Elimination. How can we solve

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Solving Linear Systems

Singular Value Decomposition and Principal Component Analysis (PCA) I

Factor Analysis (10/2/13)

CS412: Lecture #17. Mridul Aanjaneya. March 19, 2015

Pivoting. Reading: GV96 Section 3.4, Stew98 Chapter 3: 1.3

Boolean Inner-Product Spaces and Boolean Matrices

New Coherence and RIP Analysis for Weak. Orthogonal Matching Pursuit

Chapter 1: Systems of linear equations and matrices. Section 1.1: Introduction to systems of linear equations

Clustering K-means. Machine Learning CSE546. Sham Kakade University of Washington. November 15, Review: PCA Start: unsupervised learning

Prediction of double gene knockout measurements

Convex relaxation for Combinatorial Penalties

Multivariate Statistical Analysis

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

CPSC 340: Machine Learning and Data Mining. Sparse Matrix Factorization Fall 2017

Robust Near-Separable Nonnegative Matrix Factorization Using Linear Optimization

Scientific Computing

High Dimensional Covariance and Precision Matrix Estimation

Composite Loss Functions and Multivariate Regression; Sparse PCA

. =. a i1 x 1 + a i2 x 2 + a in x n = b i. a 11 a 12 a 1n a 21 a 22 a 1n. i1 a i2 a in

The deterministic Lasso

Determinants. Chia-Ping Chen. Linear Algebra. Professor Department of Computer Science and Engineering National Sun Yat-sen University 1/40

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Covariate-Assisted Variable Ranking

DISCUSSION OF INFLUENTIAL FEATURE PCA FOR HIGH DIMENSIONAL CLUSTERING. By T. Tony Cai and Linjun Zhang University of Pennsylvania

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 13

Copositive Plus Matrices

Math 896 Coding Theory

CS264: Beyond Worst-Case Analysis Lecture #15: Topic Modeling and Nonnegative Matrix Factorization

RATE-OPTIMAL GRAPHON ESTIMATION. By Chao Gao, Yu Lu and Harrison H. Zhou Yale University

Recovering overcomplete sparse representations from structured sensing

AMS 209, Fall 2015 Final Project Type A Numerical Linear Algebra: Gaussian Elimination with Pivoting for Solving Linear Systems

Stochastic Optimization Algorithms Beyond SG

Lecture 22: More On Compressed Sensing

A Numerical Algorithm for Block-Diagonal Decomposition of Matrix -Algebras, Part II: General Algorithm

Linear Equations and Matrix

Numerical Methods - Numerical Linear Algebra

CO350 Linear Programming Chapter 6: The Simplex Method

Sparsity Models. Tong Zhang. Rutgers University. T. Zhang (Rutgers) Sparsity Models 1 / 28

Probabilistic Graphical Models

Solving linear equations with Gaussian Elimination (I)

Dense LU factorization and its error analysis

2.3. Clustering or vector quantization 57

Sparse PCA in High Dimensions

Dimension Reduction and Iterative Consensus Clustering

Linear Algebra. Solving Linear Systems. Copyright 2005, W.R. Winfrey

Math 577 Assignment 7

Robust Spectral Inference for Joint Stochastic Matrix Factorization

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES

Lecture # 20 The Preconditioned Conjugate Gradient Method

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

Multi-stage convex relaxation approach for low-rank structured PSD matrix recovery

Adaptive estimation of the copula correlation matrix for semiparametric elliptical copulas

The Solution of Linear Systems AX = B

Algebra C Numerical Linear Algebra Sample Exam Problems

Direct Methods for Solving Linear Systems. Matrix Factorization

Matrix Factorization and Analysis

14.2 QR Factorization with Column Pivoting

Methods for sparse analysis of high-dimensional data, II

Linear Algebra Practice Final

5.6. PSEUDOINVERSES 101. A H w.

Lemma 8: Suppose the N by N matrix A has the following block upper triangular form:

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Computational Methods. Systems of Linear Equations

1.5 Gaussian Elimination With Partial Pivoting.

Random Methods for Linear Algebra

Checkered Hadamard Matrices of Order 16

Lecture 13 October 6, Covering Numbers and Maurey s Empirical Method

Planted Cliques, Iterative Thresholding and Message Passing Algorithms

Vast Volatility Matrix Estimation for High Frequency Data

Quantum Computing Lecture 2. Review of Linear Algebra

The uniform uncertainty principle and compressed sensing Harmonic analysis and related topics, Seville December 5, 2008

Least squares under convex constraint

Math Matrix Algebra

Least Squares Approximation

Transcription:

with Statistical Guarantees and LOVE Department of Statistical Science Cornell University WHOA-PSI, St. Louis, August 2017

Joint work with Mike Bing, Yang Ning and Marten Wegkamp Cornell University, Department of Statistical Science

Variable clustering What is variable clustering? Observable: X = (X 1,..., X j,..., X p ) R p, random vector. Data: X (1),..., X (n) i.i.d. X R p. Goal of variable clustering: Find sub-groups of similar coordinates of X, using Data. Goal different than data/point clustering: Find sub-groups of similar observations X (i), 1 i n. Data different than network clustering: Network data is 0/1 adjacency matrix.

Co-clustering genes using expression profiles ENSG00000272865 ENSG00000273487 ENSG00000273423 0.56 0.44 0.24 0.27 0.49 0.41 0.36 1 2 3 4 5 6 7 8 9 10 0.23

Model-based Objectives of model-based (overlapping) variable clustering Define model-based similarity between coordinates of X. Model definition depends crucially on what we want to cluster and what type of data we have: Here we cluster variables, and observe their values. Use identifiable model to define clusters of co-ordinates; allow for overlap. Estimate clusters; assess theoretically their accuracy, in the model-based framework.

A first step towards a model for overlapping clustering A sparse latent variable model with unstructured sparsity 1 X = AZ + E; A is a p K allocation matrix. 2 Z R K latent vector, E R p noise vector; Z E. 3 A is row sparse; K k=1 A jk 1, for each j {1,..., p}. Variable similarity and clusters X j and X l are similar if connected with the same Z k. Suggests definition for clusters with overlap: G k := { j {1,..., p} : A jk 0 }. Issue: model and clusters not identifiable: AZ = AQQ T Z, for any orthogonal Q. A jk may be 0, but (AQ) jk may not.

Identifiable models for overlapping clustering X = AZ + E; X R p, Z R K (Latent), A R p K. A sparse latent variable model with structured sparsity: The pure variable assumption (i) A row sparse; K k=1 A jk 1, for each j {1,..., p}. (ii) For every (column) k {1,..., K }, there exist at least two indices (rows) j {1,..., p} such that A jk = 1 and A jl = 0 for all l k. Spoiler alert! This A is identifiable up to signed permutations.

The pure variable assumption: interpretation Cluster: G k := { j {1,..., p} : A jk 0 }. The pure variable assumption A pure variable X j associates with only one latent factor Z k. Pure variables are crucial in building overlapping clusters Cluster G k is given by Z k, which is not observable. Z k = (biological) function. A pure variable X j is an observable proxy for a Z k. Observable X j performs function Z k. It anchors G k.

Overlapping clustering: interpretation An instance of determining unknown functions of variables Gene 1 (X 1 ) with function 1 anchors G 1. Gene 3 (X 3 ) with function 2 anchors G 2. Gene 2 (X 2 ) G 1 : Gene 2 performs function 1. Gene 2 (X 2 ) G 2 : Gene 2 also performs function 2. Before clustering Gene 2 had unknown function. After clustering Gene 2 is found to have dual function.

Identifiable models for overlapping clustering Ingredients for identifiability Latent variable model X = AZ + E with structure on A: the pure variable assumption. Mild assumption on C = Cov(Z ): ( (C) =: min min{cjj, C kk } C jk ) > 0. j k (C) > 0 = Z j Z k, a.s., for all j k.

Identifiability in structured sparse latent models The pure variable set I is identifiable I and its partition I =: {I k } 1 k K can be constructed uniquely, from Σ up to label permutations. The allocation matrix A is identifiable Under the pure variable assumption, there exists a unique matrix A, up to signed permutations, such that X = AZ + E. The clusters G = {G k } 1 k K are identifiable Under the pure variable assumption, the overlapping clusters G k are identifiable, up to label switching. If pure variables do not exist, identifiability of A fails.

Central challenge in proving identifiability X = AZ + E = Σ = ACA + Γ. I = Pure variable index set. J = {1,..., p} \ I = Impure variable index set. Central challenge: How to distinguish between I and J? Added challenge: How to distinguish between I and J when we don t know the noise Γ?

What is pure and what is impure? A necessary and sufficient condition for purity For each 1 i p, set M i := max j [p]\{i} Σ ij S i := { j [p] \ {i} : Σ ij = M i }. For given A and its induced pure variable set I, we have i I M i = max k [p]\{j} Σ jk for all j S i.

Look for maxima in Σ, ignore the diagonal! I = {{1, 2, 3}, {4, 5}, {6, 7}} and J = {8, 9} 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 0 0 0 2 0 0 1 1/3 0 0 0 2 0 0 1 1/3 0 0 0 0 0 3 0 1/2 0 0 0 0 0 3 0 1/2 1/2 1/2 1/2 1 1 0 0 1/6 2/3 2/3 2/3 1/3 1/3 1/2 1/2 1/6 M 1 = max k 1 Σ 1k = 1. S 1 = {j 1 : Σ 1j = 1} = {2, 3}. 1 = M 1 = max k 1 Σ 2k = 1. 1 = M 1 = max k 1 Σ 3k = 1. I 1 = S 1 1 = {1, 2, 3} = pure

Look for maxima in Σ, ignore the diagonal! I = {{1, 2, 3}, {4, 5}, {6, 7}} and J = {8, 9} 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 1 1 0 0 0 0 1/2 2/3 0 0 0 2 0 0 1 1/3 0 0 0 2 0 0 1 1/3 0 0 0 0 0 3 0 1/2 0 0 0 0 0 3 0 1/2 1/2 1/2 1/2 1 1 0 0 1/6 2/3 2/3 2/3 1/3 1/3 1/2 1/2 1/6 M 8 = max k 1 Σ 8k = 1. S 8 = {j 8 : Σ 8j = 1} = {4, 5}. 1 = M 8 max k 4 Σ 4k = 2. 8 cannot be pure! 8 J.

Estimation Estimate I. Estimate A I, the p K sub-matrix of A with rows in I. Estimate A J, with J = {1,..., p} \ I. Estimate the clusters G k.

Estimation of the pure variable set Reminder: I = pure variable set. Constructive characterization of I, population version For each 1 i p, set M i := max j [p]\{i} Σ ij S i := { j [p] \ {i} : Σ ij = M i }. For given A and its induced pure variable set I, we have i I M i = max k [p]\{j} Σ jk for all j S i. Moreover, S i {i} = I k, for some k, where I =: {I k } 1 k K is a partition of I.

Estimation of the pure variable set Algorithm idea Use the constructive characterization of I, at the population level. Replace Σ by the sample covariance Σ. Replace equalities by inequalities, allowing for tolerance level δ =: Σ Σ. Algorithm has complexity O(p 2 ). Requires input Σ, δ. Algorithm returns Î, its partition Î, and therefore K.

Estimation of the allocation submatrix A I A = [ AI Estimated previously: Î and its partition Î = {Î1,..., Î K }. The estimator ÂÎ, has rows i Î consisting of K 1 zeros and one entry equal to either +1 or 1. A J ]. Signs will be determined up to signed permutations. (1) Pick i Îk. Pick a sign for Âik, say Âik = 1. (2) For any j Îk \ {i}, we let { +1, if  jk = Σ ij > 0 1, if Σ ij < 0.

Estimation of the allocation sub-matrix A J [ ] [ ΣII Σ Σ = IJ AI CA = T I A I CA T J Σ JI Σ JJ A J CA T I A J CA T J ] [ ] ΓII 0 +. 0 Γ JJ Estimate A J row by row: motivation Σ IJ = A I CA T J θ j = CA j, for each j J. C kk =: 1 I k ( I k 1) θ j k =: 1 A ik Σ ij I k i I k i,j I k,i j Σ ij ; C km =: and 1 I k I m i I k,j I m Σ ij

Estimation of the allocation sub-matrix A J Estimation of rows of A J Under the model: θ j = CA j ; A j sparse. Available: Σ and estimated partition of pure variables Î Use Σ and Î to construct: θ j θ j, Ĉ C. Many choices to estimate sparse A j. Dantzig: Minimize β 1 over β R K such that θ j Ĉβ 2δ. Repeat for each j {1,..., Ĵ } to obtain ÂĴ.

Statistical guarantees: assumptions Recall: Σ = ACA T + Γ; X sub-gaussian: Σ Σ =: δ = O( (log p)/n). Signal strength conditions 1 Either on C: (C) = min j k ( min{cjj, C kk } C jk ) cδ. 2 Or on A: Smallest non-zero entry is larger than δ.

Estimation of the pure variable set: guarantees I = pure variables J = {1,..., p} \ I. Quasi-pure variables For each k [K ] : J k 1 = { i J : A ik 1 4δ/τ 2}. J 1 = K k=1 Jk 1. If X 1 is pure then A 1k = 1, for some k [K ]. If X 2 is quasi-pure then A 2m 1, for some m [K ].

Estimation of the pure variable set: guarantees J1 k = {i : A ik 1}; J 1 =: K k=1 Jk 1 ; I k = {i : A ik = 1}. Recovery guarantees: no signal strength conditions on A (a) K = K. (b) I Î I J 1. (c) I k Îk I k J k 1 w.h.p. for each k [K ]. Minimal recovery mistakes: no conditions on A Pure (1, 0, 0, 0, 0, 0) In, correct. Quasi Pure (0.99, 0.01, 0, 0, 0, 0) In, slight mistake. Impure (0.25, 0.25, 0.001, 0.099, 0.2, 0.2) Out, correct.

Estimation of the pure variable set: guarantees Exact recovery, under conditions on A Î = I, up to label switching, with I = K a=1 I a. Exact recovery: conditions on A Pure (1, 0, 0, 0, 0, 0) In, correct. Quasi Pure (0.99, 0.01, 0, 0, 0, 0) Not allowed Impure I (0.25, 0.25, 0.001, 0.099, 0.2, 0.2) Not allowed Impure II (0.25, 0.25, 0.1, 0.1, 0.3, 0.2) Out, correct.

Estimation of the allocation matrix A: guarantees Sup-norm consistency Let H denote the space of all K by K signed permutation matrices. We have, with probability exceeding 1 c 1 p c 2, 1 K = K. 2 min P H Â PA κ log p/n; κ =: C 1,1. Non-standard bound; similar to errors-in-variables model bounds. If C is diagonally dominant then κ is constant.

Activation and inhibition +1 0 0 1 0 0 0 1 0 0 1 0 A = 0 0 +1, Â = 0 0 +1 +1/2 1/2 0 +1/2 1/6 1/6 1 0 0 +1 0 0 0 1 0 0 1 0 0 0 1 0 0 1 1/2 1/2 0 2/3 1/6 +1/6 Care in interpreting the signs: For each latent factor Z k we can consistently determine which of the X j s are associated with Z k in the same direction, but not the direction.

Estimation of the overlapping groups Ĝ = { } Ĝ 1,..., Ĝ K, Ĝk = { i : Âik 0 }. i,k FPR = 1 {A ik =0,Âik 0} i,k i,k 1, FNR = 1 {A ik 0,Âik =0} {A ik =0} i,k 1. {A ik 0} Guarantees for cluster recovery Under conditions on C: Under conditions on A: All results hold w.h.p. K = K ; FPR = 0; FNR = β. K = K ; FPR = 0; FNR = 0; Ĝ = G.

Sparsity per row: s j = A j 0 = k 1{A jk 0} J 1 = Quasi pure variables J 2 = Variables associated with some Z s below noise J 3 = Variables associated with all Z s above noise j J FNR = β 1 J 2 s j j J 1 J 2 s j + j J 3 I s. j If J 3 + I >> J 1 + J 2, β is very small.

LOVE A Latent model approach to OVErlapping clustering. Estimate the partition I of pure variables by Î Estimate separately A I and A J to obtain Â, the allocation matrix estimate. Estimate the overlapping clusters by Ĝ

Co-clustering genes using expression profiles: p = 500 Benchmark data set: RNA-seq transcript level data; Blood platelet samples from n = 285 individuals. ENSG00000273487 and ENSG00000272865 both non-coding RNA: placed together in Cluster 4. Each also placed in other clusters. Non-coding RNAs are pleiotropic (multiple functions). ENSG00000272865 ENSG00000273487 ENSG00000273423 0.56 0.41 0.24 0.27 0.44 0.49 0.36 1 2 3 4 5 6 0.23 7 8 9 10

Related work Large literature on Non-Negative Matrix Factorization (NMF) X = AZ + E; X, A, Z non-negative matrices. Goal of NMF: find à and Z with X à Z ɛ. In NMF, the pure variable assumption is needed for: Identifiability of A, when E = 0 (Donoho and Stodden, 2007). Identifiability in topic models (count data), Arora et al (2013): column sums of X and A are 1; E = 0. Polynomial time NMF algorithms: Arora et al (2012, 2013); Bittorf et al (2013). Other restrictions on matrices needed.

What can you do with LOVE? All you need is LOVE 1 A flexible identifiable latent factor model for overlapping clustering: no restrictions on X and Z. 2 New in the clustering lit: A has both + and entries. 3 New: A and clusters identifiable in the presence of non-ignorable noise E. 4 New algorithm: LOVE, runs in O(p 2 + pk ) time. 5 New: Statistical guarantees for data generated from X = AZ + E, with X sub-gaussian; immediate extensions to Gaussian copula. 6 New: A with both + and - allows for a more refined cluster interpretation.

with Statistical Guarantees (2017); F. Bunea, Y. Ning, M. Wegkamp https://arxiv.org/abs/1704.06977 [ Old version; new version coming soon!] Minimax Optimal Variable Clustering in G-models via Cord (2016); F. Bunea, C. Giraud, X. Luo https://arxiv.org/abs/1508.01939 [ Non-overlapping clustering] PECOK: a convex optimization approach to variable clustering(2016); F. Bunea, C.Giraud, M. Royer, N. Verzelen https://arxiv.org/abs/1606.05100 [ Non-overlapping clustering]

Thanks!