Optimality and Sub-optimality of Principal Component Analysis for Spiked Random Matrices

Size: px

Start display at page:

Download "Optimality and Sub-optimality of Principal Component Analysis for Spiked Random Matrices"

Belinda Mills
5 years ago
Views:

1 Optimality and Sub-optimality of Principal Component Analysis for Spiked Random Matrices Alex Wein MIT Joint work with: Amelia Perry (MIT), Afonso Bandeira (Courant NYU), Ankur Moitra (MIT) 1 / 22

2 Random Matrices Wigner Matrix 1 n W R n n symmetric, W ij N (0, 1) E. P. Wigner, AoM V. A. Marchenko, L. A. Pastur, M.S / 22

3 Random Matrices Wigner Matrix 1 n W R n n symmetric, W ij N (0, 1) E. P. Wigner, AoM V. A. Marchenko, L. A. Pastur, M.S / 22

4 Random Matrices Wigner Matrix 1 n W R n n symmetric, W ij N (0, 1) Wishart Matrix 1 N N k=1 y ky T k y k N (0, I n ) E. P. Wigner, AoM V. A. Marchenko, L. A. Pastur, M.S / 22

5 Random Matrices Wigner Matrix 1 n W R n n symmetric, W ij N (0, 1) Wishart Matrix 1 N N k=1 y ky T k y k N (0, I n ) E. P. Wigner, AoM V. A. Marchenko, L. A. Pastur, M.S / 22

6 Spiked Models Spiked Wigner Matrix Y = 1 n W + λxx T 1 n W Wigner, x = 1. J. Baik, G. Ben Arous, S. Peche, AoP D. Feral, S. Peche, CMP / 22

7 Spiked Models Spiked Wigner Matrix Y = 1 n W + λxx T 1 n W Wigner, x = Visible on the largest eigenvalue when λ > 1 J. Baik, G. Ben Arous, S. Peche, AoP D. Feral, S. Peche, CMP / 22

8 Spiked Models Spiked Wigner Matrix Y = 1 n W + λxx T 1 n W Wigner, x = 1. Wishart Matrix Y = 1 N N k=1 y ky T k y k N ( 0, I n + βxx T ), x = Visible on the largest eigenvalue when λ > 1 J. Baik, G. Ben Arous, S. Peche, AoP D. Feral, S. Peche, CMP / 22

9 Spiked Models Spiked Wigner Matrix Y = 1 n W + λxx T 1 n W Wigner, x = 1. Wishart Matrix Y = 1 N N k=1 y ky T k y k N ( 0, I n + βxx T ), x = Visible on the largest eigenvalue when λ > 1 J. Baik, G. Ben Arous, S. Peche, AoP D. Feral, S. Peche, CMP Visible on the largest eigenvalue when β > γ, γ = n N, β [ 1, ) 3 / 22

10 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T 4 / 22

11 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T Recovery: nontrivial correlation with truth: x, ˆx > ɛ 4 / 22

12 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T Recovery: nontrivial correlation with truth: x, ˆx > ɛ PCA solves both detection and recovery above threshold λ > 1 (Wigner) 4 / 22

13 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T Recovery: nontrivial correlation with truth: x, ˆx > ɛ PCA solves both detection and recovery above threshold λ > 1 (Wigner) Are they statistically possible below the threshold? 4 / 22

14 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T Recovery: nontrivial correlation with truth: x, ˆx > ɛ PCA solves both detection and recovery above threshold λ > 1 (Wigner) Are they statistically possible below the threshold? Need a prior on the spike x R n 4 / 22

15 Statistical Questions Detection: distinguish reliably (error prob 0) Y 1 n W (Wigner) vs Y 1 n W + λxx T Recovery: nontrivial correlation with truth: x, ˆx > ɛ PCA solves both detection and recovery above threshold λ > 1 (Wigner) Are they statistically possible below the threshold? Need a prior on the spike x R n unit sphere i.i.d. ±1 sparse ±1 4 / 22

16 Detection vs Recovery Hypothesis testing power Recovery quality λ 5 / 22

17 Detection vs Recovery Hypothesis testing power Recovery quality 1.0 Optimal 0.8 PCA λ PCA is optimal 5 / 22

18 Detection vs Recovery Hypothesis testing power Recovery quality 1.0 Optimal 0.8 PCA λ PCA is sub-optimal 5 / 22

19 Detection vs Recovery Hypothesis testing power Recovery quality 1.0 Optimal 0.8 PCA λ PCA is sub-optimal This talk: focus on detection threshold (also hypothesis testing bounds, recovery threshold) 5 / 22

20 3 Scenarios 1. PCA achieves optimal threshold (e.g. Wigner with spherical prior) 6 / 22

21 3 Scenarios 1. PCA achieves optimal threshold (e.g. Wigner with spherical prior) 2. Can beat PCA with an efficient algorithm (e.g. non-gaussian Wigner) 6 / 22

22 3 Scenarios 1. PCA achieves optimal threshold (e.g. Wigner with spherical prior) 2. Can beat PCA with an efficient algorithm (e.g. non-gaussian Wigner) 3. Can beat PCA, but only with an inefficient algorithm (e.g. sparse priors; Wishart) 6 / 22

23 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n L. Le Cam, / 22

24 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n If distributions are contiguous, there is no reliable test to distinguish them. L. Le Cam, / 22

25 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n If distributions are contiguous, there is no reliable test to distinguish them. Proof : Let A n be the event that distinguisher says P n. L. Le Cam, / 22

26 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n If distributions are contiguous, there is no reliable test to distinguish them. Proof : Let A n be the event that distinguisher says P n. Theorem (Second Moment) If E Qn ( dpn dq n ) 2 = O(1), then Pn is contiguous to Q n. L. Le Cam, / 22

27 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n If distributions are contiguous, there is no reliable test to distinguish them. Proof : Let A n be the event that distinguisher says P n. Theorem (Second Moment) If E Qn ( dpn dq n ) 2 = O(1), then Pn is contiguous to Q n. E.g. if P n, Q n have densities p n, q n : E Qn ( dpn dq n ) 2 = EY Qn ( pn(y ) q n(y ) ) 2 L. Le Cam, / 22

28 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n Theorem (Second Moment) If E Qn ( dpn dq n ) 2 = O(1), then Pn is contiguous to Q n. Proof : L. Le Cam, / 22

29 Tool: Contiguity Sequence of distributions P n is contiguous to Q n if for any sequence of events A n, lim Q n (A n ) = 0 lim P n (A n ) = 0. n n Theorem (Second Moment) If E Qn ( dpn dq n ) 2 = O(1), then Pn is contiguous to Q n. Proof : A n dp n }{{} P n(a n) = A n dp n dq n dq n ( ( ) 2 dpn dq n A n dq n }{{} E Qn ( dpn dqn ) 2 ) 1 2 ( A n dq n }{{} Q n(a n) ) 1 2 L. Le Cam, / 22

30 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

31 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

32 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 E.g. if the prior X is uniform on the unit sphere (and λ < 1): A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

33 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 E.g. if the prior X is uniform on the unit sphere (and λ < 1): lim n E x,x X ( λ 2 ) n exp 2 x, x 2 = ( 1 λ 2) 1 2 < A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

34 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 E.g. if the prior X is uniform on the unit sphere (and λ < 1): lim n E x,x X ( λ 2 ) n exp 2 x, x 2 = ( 1 λ 2) 1 2 < The distributions are contiguous below the spectral threshold A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

35 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 E.g. if the prior X is uniform on the unit sphere (and λ < 1): lim n E x,x X ( λ 2 ) n exp 2 x, x 2 = ( 1 λ 2) 1 2 < The distributions are contiguous below the spectral threshold Same can be shown for the Wishart case A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

36 Second Moment Taking P n : 1 n W + λxx T with x X, and Q n : 1 n W ( ) 2 ( dpn λ 2 ) n E Qn = E x,x dq X exp n 2 x, x 2 E.g. if the prior X is uniform on the unit sphere (and λ < 1): lim n E x,x X ( λ 2 ) n exp 2 x, x 2 = ( 1 λ 2) 1 2 < The distributions are contiguous below the spectral threshold Same can be shown for the Wishart case But what about when we know more about the spike? A. Montanari, D. Reichman, O. Zeitouni, NIPS A. Onatski, M. J. Moreira, M. Hallin, AoS / 22

37 Rademacher Spike Prior Y 1 n W (Wigner) vs Y 1 n W + λxx T, x Unif { } n. ± 1 n J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

38 Rademacher Spike Prior Y 1 n W (Wigner) vs Y 1 n W + λxx T, x Unif { } n. ± 1 n Still Contiguous for λ < 1 J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

39 Rademacher Spike Prior Y 1 n W (Wigner) vs Y 1 n W + λxx T, x Unif { } n. ± 1 n Still Contiguous for λ < 1 Contiguity argument goes through for a general class of priors! J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

40 Rademacher Spike Prior Y 1 n W (Wigner) vs Y 1 n W + λxx T, x Unif { } n. ± 1 n Still Contiguous for λ < 1 Contiguity argument goes through for a general class of priors! But for sparse priors, with enough sparsity, PCA is no longer optimal. J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

41 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n 11 / 22

42 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n Same story for spherical prior, contiguous for β < γ (PCA threshold). 11 / 22

43 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n Same story for spherical prior, contiguous for β < γ (PCA threshold). For Rademacher prior, something surprising happens: 11 / 22

44 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n Same story for spherical prior, contiguous for β < γ (PCA threshold). For Rademacher prior, something surprising happens: If n N = γ < 1 3 then the models are contiguous for β < γ 11 / 22

45 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n Same story for spherical prior, contiguous for β < γ (PCA threshold). For Rademacher prior, something surprising happens: If n N = γ < 1 3 then the models are contiguous for β < γ But... for γ > there exists a computationally inefficient procedure that distinguishes the two models for some β ( γ, 0) (below the spectral threshold). 11 / 22

46 What about for Wishart? 1 N N y k yk T k=1 y k N (0, I n ) vs y k N ( 0, I n + βxx ) { } n T, x Unif ± 1 n Same story for spherical prior, contiguous for β < γ (PCA threshold). For Rademacher prior, something surprising happens: If n N = γ < 1 3 then the models are contiguous for β < γ But... for γ > there exists a computationally inefficient procedure that distinguishes the two models for some β ( γ, 0) (below the spectral threshold). Is there a computational gap? 11 / 22

47 Rademacher Wishart, Negative β β γ PCA: succeeds above the line inefficient algorithm: succeeds above the line contiguity lower bound: impossible below the line 12 / 22

48 Back to Wigner: What if noise is not Gaussian? x Unif{S n 1 }, W R n n Y = 1 n W + λxx T but W ij p(w) such that Ew = 0, Ew 2 = 1. D. Feral, S. Peche, CMP T. Tao, V. Vu, MAoRMT / 22

49 Back to Wigner: What if noise is not Gaussian? x Unif{S n 1 }, W R n n Y = 1 n W + λxx T but W ij p(w) such that Ew = 0, Ew 2 = 1. Universality: spectral properties are unchanged... D. Feral, S. Peche, CMP T. Tao, V. Vu, MAoRMT / 22

50 Can you tell which one is which? For W ij Unif(±1), λ < 1 W + λ nxx T vs W 14 / 22

51 Can you tell which one is which? For W ij Unif(±1), λ < 1 W + λ nxx T vs W vs 14 / 22

52 Can you tell which one is which? For W ij Unif(±1), λ < 1 W + λ nxx T vs W vs Let s restrict ourselves to when the density p(w) is smooth. 14 / 22

53 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. 15 / 22

54 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) 15 / 22

55 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) f (W ij ) + f (W ij ) λ nx i x j 15 / 22

56 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) f (W ij ) + f (W ij ) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j (f (W ij ) Ef (W ij )) λ nx i x j 15 / 22

57 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) f (W ij ) + f (W ij ) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j (f (W ij ) Ef (W ij )) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j. 15 / 22

58 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) f (W ij ) + f (W ij ) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j (f (W ij ) Ef (W ij )) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j. It is (close to) a new spiked Wigner matrix with λ = λef (W ij ) Ef 2 (W ij ). 15 / 22

59 Entrywise function If noise drawn from non-gaussian p(w): we will beat PCA by applying some function f : R R entrywise to our matrix Y = W + λ nxx, followed by PCA. f (Y ij ) = f ( W ij + λ nx i x j ) f (W ij ) + f (W ij ) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j (f (W ij ) Ef (W ij )) λ nx i x j f (W ij ) + E[f (W ij )]λ nx i x j. It is (close to) a new spiked Wigner matrix with λ = λef (W ij ) Ef 2 (W ij ). Calculus of variations gives optimal choice of f : f (w) = p (w) p(w) 15 / 22

60 Pre-transformed PCA Figure: Dashed: p(w), Solid: f (w) = p (w)/p(w) T. Lesieur, F. Krzakala, L. Zdeborová, Allerton F. Krzakala, J. Xu, L. Zdeborová, / 22

61 Pre-transformed PCA Y Figure: Dashed: p(w), Solid: f (w) = p (w)/p(w) f (Y ) T. Lesieur, F. Krzakala, L. Zdeborová, Allerton F. Krzakala, J. Xu, L. Zdeborová, / 22

62 Pre-transformed PCA Y Figure: Dashed: p(w), Solid: f (w) = p (w)/p(w) f (Y ) New threshold at λ = 1, Fp F p = E p ( p (w) p(w) ) 2 1. T. Lesieur, F. Krzakala, L. Zdeborová, Allerton F. Krzakala, J. Xu, L. Zdeborová, / 22

63 Pre-transformed PCA Y Figure: Dashed: p(w), Solid: f (w) = p (w)/p(w) f (Y ) New threshold at λ = 1, Fp F p = E p ( p (w) p(w) ) 2 1. Gaussian is hardest. T. Lesieur, F. Krzakala, L. Zdeborová, Allerton F. Krzakala, J. Xu, L. Zdeborová, / 22

64 Pre-transformed PCA Y Figure: Dashed: p(w), Solid: f (w) = p (w)/p(w) f (Y ) New threshold at λ = 1, Fp Contiguity shows this is optimal! F p = E p ( p (w) p(w) ) 2 1. Gaussian is hardest. T. Lesieur, F. Krzakala, L. Zdeborová, Allerton F. Krzakala, J. Xu, L. Zdeborová, / 22

65 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 ) n E Qn = E exp dq n 2 x, x 2 x, x X 17 / 22

66 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 n E Qn = E exp dq n = 0 Pr ) 2 x, x 2 [ exp ( λ 2 n 2 x, x 2 x, x X ) ] t dt 17 / 22

67 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 ) n E Qn = E exp dq n 2 x, x 2 x, x X [ ( λ 2 ) ] n = Pr exp 0 2 x, x 2 t dt [ ] 2 log t = 2 Pr x, x λ 2 dt n 0 17 / 22

68 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 ) n E Qn = E exp dq n 2 x, x 2 x, x X [ ( λ 2 ) ] n = Pr exp 0 2 x, x 2 t dt [ ] 2 log t = 2 Pr x, x 0 λ 2 dt n 1 ( λ = 2 Pr [ x, x 2 ) n u] exp 2 u2 λ 2 n u du 0 17 / 22

69 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 ) n E Qn = E exp dq n 2 x, x 2 x, x X [ ( λ 2 ) ] n = Pr exp 0 2 x, x 2 t dt [ ] 2 log t = 2 Pr x, x 0 λ 2 dt n 1 ( λ = 2 Pr [ x, x 2 ) n u] exp λ 2 n u du u2 [ ( λ λ 2 2 n u exp n 2 u2 R(u) )] du Rate function: R(u) = lim n 1 n log Pr [ x, x u] 17 / 22

70 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 ( dpn λ 2 ) n E Qn = E exp dq n 2 x, x 2 x, x X [ ( λ 2 ) ] n = Pr exp 0 2 x, x 2 t dt [ ] 2 log t = 2 Pr x, x 0 λ 2 dt n 1 ( λ = 2 Pr [ x, x 2 ) n u] exp λ 2 n u du u2 [ ( λ λ 2 2 n u exp n 2 u2 R(u) )] du Rate function: R(u) = lim n 1 n log Pr [ x, x u] In other words: Pr [ x, x u] exp( n R(u)) 17 / 22

71 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 1 [ ( )] dpn λ E Qn 2 λ 2 2 n u exp n dq n 0 2 u2 R(u) du 18 / 22

72 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 1 [ ( )] dpn λ E Qn 2 λ 2 2 n u exp n dq n 0 2 u2 R(u) du Bounded iff λ2 2 u2 R(u) < 0 for all u (0, 1) 18 / 22

73 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 1 [ ( )] dpn λ E Qn 2 λ 2 2 n u exp n dq n 0 2 u2 R(u) du Bounded iff λ2 2 u2 R(u) < 0 for all u (0, 1) How big a parabola can you fit underneath the rate function? 18 / 22

74 Proof Details: Bounding the (Wigner) Second Moment ( ) 2 1 [ ( )] dpn λ E Qn 2 λ 2 2 n u exp n dq n 0 2 u2 R(u) du Bounded iff λ2 2 u2 R(u) < 0 for all u (0, 1) How big a parabola can you fit underneath the rate function? E.g. Rademacher prior (±1) has R(u) = log 2 h ( ) 1+u 2 where h(p) = p log p (1 p) log(1 p) R(u) 1 2 u2 (λ = 1) / 22

75 Improvement: Conditioning J. Banks, C. Moore, J. Neeman, P. Netrapalli, COLT / 22

76 Improvement: Conditioning Goal: P n contiguous to Q n : Q n (A n ) 0 P n (A n ) 0 J. Banks, C. Moore, J. Neeman, P. Netrapalli, COLT / 22

77 Improvement: Conditioning Goal: P n contiguous to Q n : Q n (A n ) 0 P n (A n ) 0 Sufficient to show P n is contiguous to Q n, where P n, P n agree with prob 1 o(1) J. Banks, C. Moore, J. Neeman, P. Netrapalli, COLT / 22

78 Improvement: Conditioning Goal: P n contiguous to Q n : Q n (A n ) 0 P n (A n ) 0 Sufficient to show P n is contiguous to Q n, where P n, P n agree with prob 1 o(1) Choose P n to make E Qn ( d Pn dq n ) 2 small J. Banks, C. Moore, J. Neeman, P. Netrapalli, COLT / 22

79 Improvement: Conditioning Goal: P n contiguous to Q n : Q n (A n ) 0 P n (A n ) 0 Sufficient to show P n is contiguous to Q n, where P n, P n agree with prob 1 o(1) Choose P n to make E Qn ( d Pn dq n ) 2 small Condition away from rare bad events J. Banks, C. Moore, J. Neeman, P. Netrapalli, COLT / 22

80 Conditioning Example: Sparse Rademacher Prior J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

81 Conditioning Example: Sparse Rademacher Prior Sparsity ρ [0, 1] J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

82 Conditioning Example: Sparse Rademacher Prior Sparsity ρ [0, 1] Prior X ρ : i.i.d. x i 1 ρn +1 w.p. ρ/2 1 w.p. ρ/2 0 w.p. 1 ρ J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

83 Conditioning Example: Sparse Rademacher Prior Sparsity ρ [0, 1] +1 w.p. ρ/2 Prior X ρ : i.i.d. x i 1 ρn 1 w.p. ρ/2 0 w.p. 1 ρ P n : spiked Wigner with prior X ρ, Q n : unspiked J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

84 Conditioning Example: Sparse Rademacher Prior Sparsity ρ [0, 1] +1 w.p. ρ/2 Prior X ρ : i.i.d. x i 1 ρn 1 w.p. ρ/2 0 w.p. 1 ρ P n : spiked Wigner with prior X ρ, Q n : unspiked P n : change prior to X ρ : condition on close-to-typical proportion of nonzeros J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

85 Conditioning Example: Sparse Rademacher Prior Sparsity ρ [0, 1] +1 w.p. ρ/2 Prior X ρ : i.i.d. x i 1 ρn 1 w.p. ρ/2 0 w.p. 1 ρ P n : spiked Wigner with prior X ρ, Q n : unspiked P n : change prior to X ρ : condition on close-to-typical proportion of nonzeros 0.6 conditioned R(u) unconditioned R(u) Figure: ρ = 0.2 J. Banks, C. Moore, R. Vershynin, J. Xu, / 22

86 Sparse Rademacher: Results λ ρ unconditioned conditioned noise-conditioned (upcoming) replica prediction (truth) 21 / 22

87 Summary 22 / 22

88 Summary 3 scenarios: 22 / 22

89 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) 22 / 22

90 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) PCA beaten efficiently (non-gaussian Wigner) 22 / 22

91 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) PCA beaten efficiently (non-gaussian Wigner) PCA beaten inefficiently (Rademacher Wishart; sparse Rademacher Wigner) 22 / 22

92 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) PCA beaten efficiently (non-gaussian Wigner) PCA beaten inefficiently (Rademacher Wishart; sparse Rademacher Wigner) Second moment method: simple, widely-applicable technique to show non-detection lower bounds 22 / 22

93 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) PCA beaten efficiently (non-gaussian Wigner) PCA beaten inefficiently (Rademacher Wishart; sparse Rademacher Wigner) Second moment method: simple, widely-applicable technique to show non-detection lower bounds Is it possible to match the replica prediction with a simple method? 22 / 22

94 Summary 3 scenarios: PCA optimal (Wigner with spherical or Rademacher prior) PCA beaten efficiently (non-gaussian Wigner) PCA beaten inefficiently (Rademacher Wishart; sparse Rademacher Wigner) Second moment method: simple, widely-applicable technique to show non-detection lower bounds Is it possible to match the replica prediction with a simple method? Thanks! Questions? 22 / 22

OPTIMALITY AND SUB-OPTIMALITY OF PCA I: SPIKED RANDOM MATRIX MODELS

Submitted to the Annals of Statistics OPTIMALITY AND SUB-OPTIMALITY OF PCA I: SPIKED RANDOM MATRIX MODELS By Amelia Perry and Alexander S. Wein and Afonso S. Bandeira and Ankur Moitra Massachusetts Institute