OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS. Philippe Rigollet (joint with Quentin Berthet)

Similar documents
Planted Cliques, Iterative Thresholding and Message Passing Algorithms

Information-theoretically Optimal Sparse PCA

Reconstruction in the Generalized Stochastic Block Model

Non white sample covariance matrices.

Optimality and Sub-optimality of Principal Component Analysis for Spiked Random Matrices

Random Correlation Matrices, Top Eigenvalue with Heavy Tails and Financial Applications

Applications of random matrix theory to principal component analysis(pca)

On the distinguishability of random quantum states

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

OPTIMALITY AND SUB-OPTIMALITY OF PCA I: SPIKED RANDOM MATRIX MODELS

Sparse PCA in High Dimensions

Assessing the dependence of high-dimensional time series via sample autocovariances and correlations

A direct formulation for sparse PCA using semidefinite programming

approximation algorithms I

CS281 Section 4: Factor Analysis and PCA

Sparse Covariance Selection using Semidefinite Programming

Statistical-Computational Phase Transitions in Planted Models: The High-Dimensional Setting

A direct formulation for sparse PCA using semidefinite programming

Composite Loss Functions and Multivariate Regression; Sparse PCA

Relaxations and Randomized Methods for Nonconvex QCQPs

Convex Optimization M2

Tractable Upper Bounds on the Restricted Isometry Constant

Tighten after Relax: Minimax-Optimal Sparse PCA in Polynomial Time

Sparse PCA with applications in finance

Extreme eigenvalues of Erdős-Rényi random graphs

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Probabilistic Graphical Models

Short Course Robust Optimization and Machine Learning. Lecture 4: Optimization in Unsupervised Learning

CS369N: Beyond Worst-Case Analysis Lecture #4: Probabilistic and Semirandom Models for Clustering and Graph Partitioning

arxiv: v5 [math.na] 16 Nov 2017

Statistical-Computational Tradeoffs in Planted Problems and Submatrix Localization with a Growing Number of Clusters and Submatrices

Robust Principal Component Analysis

On the principal components of sample covariance matrices

Painlevé Representations for Distribution Functions for Next-Largest, Next-Next-Largest, etc., Eigenvalues of GOE, GUE and GSE

(k, q)-trace norm for sparse matrix factorization

Estimation of large dimensional sparse covariance matrices

Information-theoretic bounds and phase transitions in clustering, sparse PCA, and submatrix localization

Community detection in stochastic block models via spectral methods

Norms of Random Matrices & Low-Rank via Sampling

Principal Component Analysis

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Information Recovery from Pairwise Measurements

Large sample covariance matrices and the T 2 statistic

Average-case hardness of RIP certification

Random Matrices and Multivariate Statistical Analysis

Solving Corrupted Quadratic Equations, Provably

EIGENVECTOR OVERLAPS (AN RMT TALK) J.-Ph. Bouchaud (CFM/Imperial/ENS) (Joint work with Romain Allez, Joel Bun & Marc Potters)

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

Introduction to Graphical Models

Probabilistic Graphical Models

Random Matrix Theory Lecture 1 Introduction, Ensembles and Basic Laws. Symeon Chatzinotas February 11, 2013 Luxembourg

Canonical SDP Relaxation for CSPs

Numerical Solutions to the General Marcenko Pastur Equation

Statistical and Computational Phase Transitions in Planted Models

Sum-of-Squares Method, Tensor Decomposition, Dictionary Learning

Regularized Estimation of High Dimensional Covariance Matrices. Peter Bickel. January, 2008

Quadratic reformulation techniques for 0-1 quadratic programs

Review (Probability & Linear Algebra)

Preface to the Second Edition...vii Preface to the First Edition... ix

EE 227A: Convex Optimization and Applications October 14, 2008

Sparse and Low Rank Recovery via Null Space Properties

Fast and Robust Phase Retrieval

Polyhedral Approaches to Online Bipartite Matching

Invertibility of random matrices

Maximum variance formulation

1 Tridiagonal matrices

Lecture 8: Statistical vs Computational Complexity Tradeoffs via SoS

Deflation Methods for Sparse PCA

A reverse Sidorenko inequality Independent sets, colorings, and graph homomorphisms

Approximate Kernel PCA with Random Features

A direct formulation for sparse PCA using semidefinite programming

Principal Component Analysis

Computational Lower Bounds for Community Detection on Random Graphs

SDP Relaxation of Quadratic Optimization with Few Homogeneous Quadratic Constraints

arxiv: v3 [math-ph] 21 Jun 2012

CS264: Beyond Worst-Case Analysis Lectures #9 and 10: Spectral Algorithms for Planted Bisection and Planted Clique

Computational functional genomics

16.1 L.P. Duality Applied to the Minimax Theorem

STAT 501 Assignment 1 Name Spring Written Assignment: Due Monday, January 22, in class. Please write your answers on this assignment

Principal Component Analysis (PCA)

Decentralized Control of Stochastic Systems

U.C. Berkeley CS294: Beyond Worst-Case Analysis Handout 8 Luca Trevisan September 19, 2017

Robust Statistics, Revisited

Gaussian random variables inr n

Descriptive Statistics

High-dimensional statistics: Some progress and challenges ahead

Sparse principal component analysis via random projections

Slide a window along the input arc sequence S. Least-squares estimate. σ 2. σ Estimate 1. Statistically test the difference between θ 1 and θ 2

Convex Optimization in Classification Problems

Semidefinite and Second Order Cone Programming Seminar Fall 2001 Lecture 5

Convergence of Eigenspaces in Kernel Principal Component Analysis

Sample Geometry. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Asymptotic Distribution of the Largest Eigenvalue via Geometric Representations of High-Dimension, Low-Sample-Size Data

Graph Partitioning Using Random Walks

Sparse Principal Component Analysis Formulations And Algorithms

Insights into Large Complex Systems via Random Matrix Theory

Robust Sparse Principal Component Regression under the High Dimensional Elliptical Model

Spectral Graph Theory and its Applications. Daniel A. Spielman Dept. of Computer Science Program in Applied Mathematics Yale Unviersity

1 Principal Components Analysis

Applied Linear Algebra in Geoscience Using MATLAB

Transcription:

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet)

High dimensional data Cloud of point in R p

High dimensional data Cloud of point in R p

High dimensional data Cloud of n points in R p

Principal component Principal component = direction of largest variance

Principal component analysis (PCA) Tool for dimension reduction Spectrum of covariance matrix Main tool for exploratory data analysis. We study only the first principal component This talk: high-dimensional p n, finite sample framework.

Testing for sphericity under rank-one alternative H 0 : =I p H 1 : =I p + vv > v 2 =1 Isotropic Principal component

The model Observations: i.i.d. X 1,...,X n N p (0, ) Estimator: empirical covariance matrix nx ˆ = 1 n i=1 X i X > i If n p it is a consistent estimator. If n ' cp it is inconsistent (Nadler, Paul, Onatski,...) eigenvectors are orthogonal

Empirical spectrum under the null H 0 : =I p 30 25 Marcenko-Pastur distribution 20 15 10 5 0 0 1 2 3 4 5 6 7 8 Spectrum of ˆ

Empirical spectrum under the alternative H 1 : =I p + vv > v 2 =1 The BBP (Baik, Ben Arous, Péché) transition p n! >0 apple p > p Indistinguishable from the null detection possible if > r p n very strong signal!

Testing for sparse principal component H 0 : =I p H 1 : =I p + vv >, v 2 =1, v 0 apple k Isotropic Sparse principal direction

Testing for sparse principal component H 0 : =I p H 1 : =I p + vv >, v 2 =1, v 0 apple k minimum detection level? Goal: find a statistic ' : S + p 7! R such that P H0 ('(ˆ ) < 0 ) 1 small under H 0 P H1 ('(ˆ ) > 1 ) 1 large under H 1 1 1 0 1

P H0 ('(ˆ ) < 0 ) 1 small under P H1 ('(ˆ ) > 1 ) 1 large under H 0 H 1 1 1 0 1 0 apple apple 1 Take the test: (ˆ ) = 1{'(ˆ ) > }. It satisfies: P H0 ( = 1) _ max v 2 =1 v 0 applek P H1 ( = 0) apple

k-sparse eigenvalue: Sparse eigenvalue '(ˆ ) = k max(ˆ ) = max x 2 =1 x 0 apple k x > ˆ x = max S =k max (ˆ S ) Note that: k max(i p ) = 1 and k max(i p + vv > )=1+ Smaller fluctuations than the largest eigenvalue max(ˆ )

Upper bounds w.p. 1 Under the null hypothesis: k max(ˆ ) apple 1+8 Under the alternative hypothesis: k max(ˆ ) 1+ 2(1 + ) Can detect as soon as C r k log(9ep/k) + log(1/ ) 0 < 1 r k log(p/k) n n r log(1/ ) n, which yields =: 1 =: 0

Minimax lower bound Fix > 0 (small). Then there exists a constant C > 0 such that if < := r k log (C p/k 2 + 1) n ^ 1 p 2 Then inf n P n 0 ( = 1) _ max v 2 =1 v 0 applek o Pv n ( = 0) 1 2 See also Arias-Castro, Bubeck and Lugosi (12)

Computational issues k To compute, need to compute eigenvalues max(ˆ ) p k Can be used to find cliques in graphs: NP-complete pb. Need an approximation...

Semidefinite relaxation 101 k SDP max(a) k = max. subject to Tr( x AZ > A x ) Tr( x Z > x ) =1 Z x 0 apple k Z = xx > rank(z) =1 Z 0 Cauchy-Schwarz xx > 1 apple k Semidefinite program program (SDP) introduced by d Aspremont, El Gahoui, Jordan and Lanckriet (2004). Testing procedure: 1{SDP k (ˆ ) > } Defined even if solution of SDP has rank > 1

Performance of SDP For the alternative: relaxation of k max(ˆ ) so SDP k (ˆ ) k max(ˆ ) For the null: use dual (Bach et al. 2010) SDP k (A) = min U2S + p { max (A + U)+k U 1 } For any U 2 S + p this gives an upper bound on SDP k (ˆ ) Enough to look only at minimum dual perturbation MDP k (ˆ ) = min z 0 n o max(st z (ˆ )) + kz

Upper bounds w.p. 1 Under the null hypothesis: DP k (ˆ ) apple 1 + 10 Under the alternative hypothesis: Can detect as soon as 0 < 1 r k2 log(ep/ ) DP k (ˆ ) 1+ 2(1 + ) n, which yields DP 2 SDP, MDP =: 0 r log(1/ ) n =: 1 C r k2 log(p/k) n

1.08 SDP k MDP k λ max ( ) 1.06 H1/H0 1.04 1.02 1 0.98 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Ratio of 5% quantile under H 1 over 95% quantile under H 0, versus signal strength. When this ratio is larger than one, both type I and type II errors are below 5%. θ A. d Aspremont Soutenance HDR, ENS Cachan, Nov. 2012. 32/33

Summary No detection detection with k max detection with DP k r k p n log k r k 2 n log p k Can we tighten the gap?

Numerical evidence Fix type I error at 1%, plot type II error of MDPk p={50, 100, 200, 500}, k= p 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Q 0.5 P 0.4 0.3 0.2 0.4 0.3 0.2 0.1 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 f k p n log k minimax optimal scaling 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 e k 2 n log p k proved scaling

Random graphs A random (Erdos-Renyi) graph on N vertices is obtained by drawing edges at random with probability 1/2 N=50 largest clique is of size 2 log N =7.8 asymp. almost surely

Hidden clique We can hide a clique (here of size 10) in this graph Choose points arbitrarily and draw a clique

Hidden clique embed in the original random graph

Hidden clique Question: is there a hidden clique in this graph?

Hidden clique problem It is believed that it is hard to find/test the presence of a clique in a random graph (Alon, Arora, Feige, Hazan, Krauthgamer,... Cryptosystems are based on this fact!) Conjecture: It is hard to find cliques of size between 2 log N and p Alon, Krivelevich, Sudakov 98 N Feige and Krauthgamer 00 Dekel et al. 10 Feige and Ron 10 Ames and Vavasis 11 Canonical example of average case complexity

Hidden clique problem It seems related to our problem but not trivially (the randomness structure is very fragile) Note that all our results extend to sub-gaussian r.v. Theorem. If we could prove that there exists such that under the null hypothesis it holds SDP k (ˆ ) apple 1+C r k log(ep/ ) for some 2 (1, 2), then it can be used to test the presence of a clique of size polylog(n)n 1 4 n C>0

Remarks Unlike usual hardness results, this one is for one (actually two) method only (not for all methods). In progress: we can remove this limitation using bicliques (need to carefully deal with independence)

Conclusion Optimal rates for sparse detection Computationally efficient methods with suboptimal rate First(?) link between sparse detection and average case complexity Opens the door to new statistical lower bounds: complexity theoretic lower bounds Evidence that heuristics cannot be optimal