OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS. Philippe Rigollet (joint with Quentin Berthet)

OPTIMAL DETECTION OF SPARSE PRINCIPAL COMPONENTS Philippe Rigollet (joint with Quentin Berthet)

High dimensional data Cloud of point in R p

High dimensional data Cloud of n points in R p

Principal component Principal component = direction of largest variance

Principal component analysis (PCA) Tool for dimension reduction Spectrum of covariance matrix Main tool for exploratory data analysis. We study only the first principal component This talk: high-dimensional p n, finite sample framework.

Testing for sphericity under rank-one alternative H 0 : =I p H 1 : =I p + vv > v 2 =1 Isotropic Principal component

The model Observations: i.i.d. X 1,...,X n N p (0, ) Estimator: empirical covariance matrix nx ˆ = 1 n i=1 X i X > i If n p it is a consistent estimator. If n ' cp it is inconsistent (Nadler, Paul, Onatski,...) eigenvectors are orthogonal

Empirical spectrum under the null H 0 : =I p 30 25 Marcenko-Pastur distribution 20 15 10 5 0 0 1 2 3 4 5 6 7 8 Spectrum of ˆ

Empirical spectrum under the alternative H 1 : =I p + vv > v 2 =1 The BBP (Baik, Ben Arous, Péché) transition p n! >0 apple p > p Indistinguishable from the null detection possible if > r p n very strong signal!

Testing for sparse principal component H 0 : =I p H 1 : =I p + vv >, v 2 =1, v 0 apple k Isotropic Sparse principal direction

Testing for sparse principal component H 0 : =I p H 1 : =I p + vv >, v 2 =1, v 0 apple k minimum detection level? Goal: find a statistic ' : S + p 7! R such that P H0 ('(ˆ ) < 0 ) 1 small under H 0 P H1 ('(ˆ ) > 1 ) 1 large under H 1 1 1 0 1

P H0 ('(ˆ ) < 0 ) 1 small under P H1 ('(ˆ ) > 1 ) 1 large under H 0 H 1 1 1 0 1 0 apple apple 1 Take the test: (ˆ ) = 1{'(ˆ ) > }. It satisfies: P H0 ( = 1) _ max v 2 =1 v 0 applek P H1 ( = 0) apple

k-sparse eigenvalue: Sparse eigenvalue '(ˆ ) = k max(ˆ ) = max x 2 =1 x 0 apple k x > ˆ x = max S =k max (ˆ S ) Note that: k max(i p ) = 1 and k max(i p + vv > )=1+ Smaller fluctuations than the largest eigenvalue max(ˆ )

Upper bounds w.p. 1 Under the null hypothesis: k max(ˆ ) apple 1+8 Under the alternative hypothesis: k max(ˆ ) 1+ 2(1 + ) Can detect as soon as C r k log(9ep/k) + log(1/ ) 0 < 1 r k log(p/k) n n r log(1/ ) n, which yields =: 1 =: 0

Minimax lower bound Fix > 0 (small). Then there exists a constant C > 0 such that if < := r k log (C p/k 2 + 1) n ^ 1 p 2 Then inf n P n 0 ( = 1) _ max v 2 =1 v 0 applek o Pv n ( = 0) 1 2 See also Arias-Castro, Bubeck and Lugosi (12)

Computational issues k To compute, need to compute eigenvalues max(ˆ ) p k Can be used to find cliques in graphs: NP-complete pb. Need an approximation...

Semidefinite relaxation 101 k SDP max(a) k = max. subject to Tr( x AZ > A x ) Tr( x Z > x ) =1 Z x 0 apple k Z = xx > rank(z) =1 Z 0 Cauchy-Schwarz xx > 1 apple k Semidefinite program program (SDP) introduced by d Aspremont, El Gahoui, Jordan and Lanckriet (2004). Testing procedure: 1{SDP k (ˆ ) > } Defined even if solution of SDP has rank > 1

Performance of SDP For the alternative: relaxation of k max(ˆ ) so SDP k (ˆ ) k max(ˆ ) For the null: use dual (Bach et al. 2010) SDP k (A) = min U2S + p { max (A + U)+k U 1 } For any U 2 S + p this gives an upper bound on SDP k (ˆ ) Enough to look only at minimum dual perturbation MDP k (ˆ ) = min z 0 n o max(st z (ˆ )) + kz

Upper bounds w.p. 1 Under the null hypothesis: DP k (ˆ ) apple 1 + 10 Under the alternative hypothesis: Can detect as soon as 0 < 1 r k2 log(ep/ ) DP k (ˆ ) 1+ 2(1 + ) n, which yields DP 2 SDP, MDP =: 0 r log(1/ ) n =: 1 C r k2 log(p/k) n

1.08 SDP k MDP k λ max ( ) 1.06 H1/H0 1.04 1.02 1 0.98 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 Ratio of 5% quantile under H 1 over 95% quantile under H 0, versus signal strength. When this ratio is larger than one, both type I and type II errors are below 5%. θ A. d Aspremont Soutenance HDR, ENS Cachan, Nov. 2012. 32/33

Summary No detection detection with k max detection with DP k r k p n log k r k 2 n log p k Can we tighten the gap?

Numerical evidence Fix type I error at 1%, plot type II error of MDPk p={50, 100, 200, 500}, k= p 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 0.5 Q 0.5 P 0.4 0.3 0.2 0.4 0.3 0.2 0.1 0.1 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 f k p n log k minimax optimal scaling 0 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 0.16 0.18 0.2 e k 2 n log p k proved scaling

Random graphs A random (Erdos-Renyi) graph on N vertices is obtained by drawing edges at random with probability 1/2 N=50 largest clique is of size 2 log N =7.8 asymp. almost surely

Hidden clique We can hide a clique (here of size 10) in this graph Choose points arbitrarily and draw a clique

Hidden clique embed in the original random graph

Hidden clique Question: is there a hidden clique in this graph?

Hidden clique problem It is believed that it is hard to find/test the presence of a clique in a random graph (Alon, Arora, Feige, Hazan, Krauthgamer,... Cryptosystems are based on this fact!) Conjecture: It is hard to find cliques of size between 2 log N and p Alon, Krivelevich, Sudakov 98 N Feige and Krauthgamer 00 Dekel et al. 10 Feige and Ron 10 Ames and Vavasis 11 Canonical example of average case complexity

Hidden clique problem It seems related to our problem but not trivially (the randomness structure is very fragile) Note that all our results extend to sub-gaussian r.v. Theorem. If we could prove that there exists such that under the null hypothesis it holds SDP k (ˆ ) apple 1+C r k log(ep/ ) for some 2 (1, 2), then it can be used to test the presence of a clique of size polylog(n)n 1 4 n C>0

Remarks Unlike usual hardness results, this one is for one (actually two) method only (not for all methods). In progress: we can remove this limitation using bicliques (need to carefully deal with independence)

Conclusion Optimal rates for sparse detection Computationally efficient methods with suboptimal rate First(?) link between sparse detection and average case complexity Opens the door to new statistical lower bounds: complexity theoretic lower bounds Evidence that heuristics cannot be optimal