18.S096: Principal Component Analysis in High Dimensions and the Spike Model

Size: px

Start display at page:

Download "18.S096: Principal Component Analysis in High Dimensions and the Spike Model"

Emil Blake
5 years ago
Views:

1 18.S096: Pricipal Compoet Aalysis i High Dimesios ad the Spike Model opics i Mathematics of Data Sciece (Fall 2015) Afoso S. Badeira badeira@mit.edu September 18, 2015 hese are lecture otes ot i fial form ad will be cotiuously edited ad/or corrected (as I am sure it cotais may typos). Please let me kow if you fid ay typo/mistake. Also, I am postig shorter versios of these otes (together with the ope problems) o my Blog, see Ba15b]. 0.1 Brief Review of some liear algebra tools I this Sectio we ll briefly review a few liear algebra tools that will be importat durig the course. If you eed a refresh o ay of these cocepts, I recommed takig a look at HJ85] ad/or Gol96] Sigular Value Decompositio he Sigular Value Decompositio (SVD) is oe of the most useful tools for this course! Give a matrix M R m, the SVD of M is give by M = UΣV, (1) where U O(m), V O() are orthogoal matrices (meaig that U U = UU = I ad V V = V V = I) ad Σ R m is a matrix with o-egative etries i its diagoal ad otherwise zero etries. he colums of U ad V are referred to, respectively, as left ad right sigular vectors of M ad the diagoal elemets of Σ as sigular values of M. Remark 0.1 Say m, it is easy to see that we ca also thik of the SVD as havig U R m where UU = I, Σ R a diagoal matrix with o-egative etries ad V O() Spectral Decompositio If M R is symmetric the it admits a spectral decompositio M = V ΛV, 1

2 where V O() is a matrix whose colums v k are the eigevectors of M ad Λ is a diagoal matrix whose diagoal elemets λ k are the eigevalues of M. Similarly, we ca write M = λ k v k vk. Whe all of the eigevalues of M are o-egative we say that M is positive semidefiite ad write M 0. I that case we ca write M = ( V Λ 1/2) ( V Λ 1/2). A decompositio of M of the form M = UU (such as the oe above) is called a Cholesky decompositio. he spectral orm of M is defied as race ad orm Give a matrix M R, its trace is give by r(m) = M = max λ k (M). k M kk = λ k (M). Its Frobeiues orm is give by M F = Mij 2 = r(m M) A particularly importat property of the trace is that: r(ab) = ij A ij B ji = r(ba). i,j=1 Note that this implies that, e.g., r(abc) = r(cab), it does ot imply that, e.g., r(abc) = r(acb) which is ot true i geeral! 0.2 Quadratic Forms Durig the course we will be iterested i solvig problems of the type ( max r V MV ), V R d V V =I d d where M is a symmetric matrix. 2

3 Note that this is equivalet to max v 1,...,v d R v i v j=δ ij d vk Mv k, (2) where δ ij is the Kroecker delta (is 1 is i = j ad 0 otherwise). Whe d = 1 this reduces to the more familiar max v Mv. (3) v R v 2 =1 It is easy to see (for example, usig the spectral decompositio of M) that (3) is maximized by the leadig eigevector of M ad max v Mv = λ max (M). v R v 2 =1 It is also ot very difficult to see (it follows for example from a heorem of Fa (see, for example, page 3 of Mos11]) that (2) is maximized by takig v 1,..., v d to be the k leadig eigevectors of M ad that its value is simply the sum of the k largest eigevalues of M. he ice cosequece of this is that the solutio to (2) ca be computed sequetially: we ca first solve for d = 1, computig v 1, the v 2, ad so o. Remark 0.2 All of the tools ad results above have atural aalogues whe the matrices have complex etries (ad are Hermitia istead of symmetric). 1.1 Dimesio Reductio ad PCA Whe faced with a high dimesioal dataset, a atural approach is to try to reduce its dimesio, either by projectig it to a lower dimesio space or by fidig a better represetatio for the data. Durig this course we will see a few differet ways of doig dimesio reductio. We will start with Pricipal Compoet Aalysis (PCA). I fact, PCA cotiues to be oe of the best (ad simplest) tools for exploratory data aalysis. Remarkably, it dates back to a 1901 paper by Karl Pearso Pea01]! Let s say we have data poits x 1,..., x i R p, for some p, ad we are iterested i (liearly) projectig the data to d < p dimesios. his is particularly useful if, say, oe wats to visualize the data i two or three dimesios. here are a couple of differet ways we ca try to choose this projectio: 1. Fidig the d-dimesioal affie subspace for which the projectios of x 1,..., x o it best approximate the origial poits x 1,..., x. 2. Fidig the d dimesioal projectio of x 1,..., x that preserved as much variace of the data as possible. As we will see below, these two approaches are equivalet ad they correspod to Pricipal Compoet Aalysis. 3

4 Before proceedig, we recall a couple of simple statistical quatities associated with x 1,..., x, that will reappear below. Give x 1,..., x we defie its sample mea as ad its sample covariace as Σ = 1 1 µ = 1 x k, (4) (x k µ ) (x k µ ). (5) Remark 1.3 If x 1,..., x are idepedetly sampled from a distributio, µ ad Σ are ubiased estimators for, respectively, the mea ad covariace of the distributio. We will start with the first iterpretatio of PCA ad the show that it is equivalet to the secod PCA as best d-dimesioal affie fit We are tryig to approximate each x k by x k µ + d (β k ) i v i, (6) i=1 where v 1,..., v d is a orthoormal basis for the d-dimesioal subspace, µ R p represets the traslatio, ad β k correspods to the coefficiets of x k. If we represet the subspace by V = v 1 v d ] R p d the we ca rewrite (7) as x k µ + V β k, (7) where V V = I d d as the vectors v i are orthoormal. We will measure goodess of fit i terms of least squares ad attempt to solve mi µ, V, β k V V =I x k (µ + V β k ) 2 2 (8) We start by optimizig for µ. It is easy to see that the first order coditios for µ correspod to µ x k (µ + V β k ) 2 2 = 0 (x k (µ + V β k )) = 0. hus, the optimal value µ of µ satisfies ( ) x k µ V ( ) β k = 0. 4

5 Because β k = 0 we have that the optimal µ is give by µ = 1 x k = µ, the sample mea. We ca the proceed o fidig the solutio for (9) by solvig mi V, β k V V =I x k µ V β k 2 2. (9) Let us proceed by optimizig for β k. Sice the problem decouples for each k, we ca focus o, for each k, mi x k µ V β k 2 2 = mi d 2 β k β k x k µ (β k ) i v i. (10) Sice v 1,..., v d are orthoormal, it is easy to see that the solutio is give by (β k ) i = v i (x k µ ) which ca be succictly writte as β k = V (x k µ ). hus, (9) is equivalet to mi V V =I i=1 (x k µ ) V V (x k µ ) 2 2. (11) Note that (x k µ ) V V (x k µ ) 2 2 = (x k µ ) (x k µ ) 2 (x k µ ) V V (x k µ ) + (x k µ ) V ( V V ) V (x k µ ) = (x k µ ) (x k µ ) (x k µ ) V V (x k µ ). Sice (x k µ ) (x k µ ) does ot deped o V, miimizig (9) is equivalet to max V V =I (x k µ ) V V (x k µ ). (12) A few more simple algebraic maipulatios usig properties of the trace: ] (x k µ ) V V (x k µ ) = r (x k µ ) V V (x k µ ) = = r ] r V (x k µ ) (x k µ ) V V ] (x k µ ) (x k µ ) V = ( 1) r V Σ V ]. 5 2

6 his meas that the solutio to (13) is give by max r V Σ V ]. (13) V V =I As we saw above (recall (2)) the solutio is give by V = v 1,, v d ] where v 1,..., v d correspod to the d leadig eigevectors of Σ. Let us first show that iterpretatio (2) of fidig the d-dimesioal projectio of x 1,..., x that preserves the most variace also arrives to the optimizatio problem (13) PCA as d-dimesioal projectio that preserves the most variace We aim to fid a orthoormal basis v 1,..., v d (orgaized as V = v 1,..., v d ] with V V = I d d ) of a d-dimesioal space such that the projectio of x 1,..., x projected o this subspace has the most variace. Equivaletly we ca ask for the poits v1 x k., vd x k to have as much variace as possible. Hece, we are iterested i solvig Note that V x k 1 max V V =I 2 V x r = r=1 V x k 1 2 V x r. (14) r=1 V (x k µ ) 2 = r ( V Σ V ), showig that (14) is equivalet to (13) ad that the two iterpretatios of PCA are ideed equivalet Fidig the Pricipal Compoets Whe give a dataset x 1,..., x R p, i order to compute the Pricipal Compoets oe eeds to fid the leadig eigevectors of Σ = 1 1 (x k µ ) (x k µ ). A aive way of doig this would be to costruct Σ (which takes O(p 2 ) work) ad the fidig its spectral decompositio (which takes O(p 3 ) work). his meas that the computatioal complexity of this procedure is O ( max { p 2, p 3}) (see HJ85] ad/or Gol96]). A alterative is to use the Sigular Value Decompositio (1). Let X = x 1 x ] recall that, Σ = 1 ( X µ 1 ) ( X µ 1 ). 6

7 Let us take the SVD of X µ 1 = U L DU R with U L O(p), D diagoal, ad U R U R = I. he, Σ = 1 ( X µ 1 ) ( X µ 1 ) = UL DU R U R DU L = U L D 2 U L, meaig that U L correspod to the eigevectors of Σ. Computig the SVD of X µ 1 takes O(mi 2 p, p 2 ) but if oe is iterested i simply computig the top d eigevectors the this computatioal costs reduces to O(dp). his ca be further improved with radomized algorithms. here are radomized algorithms that compute a approximate solutio i O ( p log d + (p + )d 2) time (see for example HM09, RS09, MM15]) Which d should we pick? Give a dataset, if the objective is to visualize it the pickig d = 2 or d = 3 might make the most sese. However, PCA is useful for may other purposes, for example: (1) ofte times the data belogs to a lower dimesioal space but is corrupted by high dimesioal oise. Whe usig PCA it is oftetimess possible to reduce the oise while keepig the sigal. (2) Oe may be iterested i ruig a algorithm that would be too computatioally expesive to ru i high dimesios, dimesio reductio may help there, etc. I these applicatios (ad may others) it is ot clear how to pick d. If we deote the k-th largest eigevalue of Σ as λ (+) (Σ ), the the k-th pricipal compoet has a λ(+) k (Σ) r(σ ) proportio of the variace. 2 A fairly popular heuristic is to try to choose the cut-off at a compoet that has sigificatly more variace tha the oe immediately after. his is usually visualized by a scree plot: a plot of the values of the ordered eigevalues. Here is a example: k 1 If there is time, we might discuss some of these methods later i the course. 2 Note that r (Σ ) = p λ k (Σ ). 7

8 It is commo to the try to idetify a elbow o the scree plot to choose the cut-off. I the ext Sectio we will look ito radom matrix theory to try to uderstad better the behavior of the eigevalues of Σ ad it will help us uderstad whe to cut-off A related ope problem We ow show a iterestig ope problem posed by Mallat ad Zeitoui at MZ11] Ope Problem 1.1 (Mallat ad Zeitoui MZ11]) Let g N (0, Σ) be a gaussia radom vector i R p with a kow covariace matrix Σ ad d < p. Now, for ay orthoormal basis V = v 1,..., v p ] of R p, cosider the followig radom variable Γ V : Give a draw of the radom vector g, Γ V is the squared l 2 orm of the largest projectio of g o a subspace geerated by d elemets of the basis V. he questio is: What is the basis V for which E Γ V ] is maximized? he cojecture i MZ11] is that the optimal basis is the eigedecompositio of Σ. It is kow that this is the case for d = 1 (see MZ11]) but the questio remais ope for d > 1. It is ot very difficult to see that oe ca assume, without loss of geerality, that Σ is diagoal. A particularly ituitive way of statig the problem is: 1. Give Σ R p p ad d 2. Pick a orthoormal basis v 1,..., v p 3. Give g N (0, Σ) 4. Pick d elemets ṽ 1,..., ṽ d of the basis 5. Score: d (ṽ i=1 i g ) 2 he objective is to pick the basis i order to maximize the expected value of the Score. Notice that if the steps of the procedure were take i a slightly differet order o which step 4 would take place before havig access to the draw of g (step 3) the the best basis is ideed the eigebasis of Σ ad the best subset of the basis is simply the leadig eigevectors (otice the resemblace with PCA, as described above). More formally, we ca write the problem as fidig argmax V R p p V V =I E max S p] S =d ( v i g ) 2, where g N (0, Σ). he observatio regardig the differet orderig of the steps amouts to sayig that the eigebasis of Σ is the optimal solutio for ( argmax max E v i g ) ] 2. V R p p S p] V V =I S =d i S i S 8

9 1.2 PCA i high dimesios ad Marceko-Pastur Let us assume that the data poits x 1,..., x R p are idepedet draws of a gaussia radom variable g N (0, Σ) for some covariace Σ R p p. I this case whe we use PCA we are hopig to fid low dimesioal structure i the distributio, which should correspod to large eigevalues of Σ (ad their correspodig eigevectors). For this reaso (ad sice PCA depeds o the spectral properties of Σ ) we would like to uderstad whether the spectral properties of Σ (eigevalues ad eigevectors) are close to the oes of Σ. Sice EΣ = Σ, if p is fixed ad the law of large umbers guaratees that ideed Σ Σ. However, i may moder applicatios it is ot ucommo to have p i the order of (or, sometimes, eve larger!). For example, if our dataset is composed by images the is the umber of images ad p the umber of pixels per image; it is coceivable that the umber of pixels be o the order of the umber of images i a set. Ufortuately, i that case, it is o loger clear that Σ Σ. Dealig with this type of difficulties is the realm of high dimesioal statistics. For simplicity we will istead try to uderstad the spectral properties of S = 1 XX. Sice x N (0, Σ) we kow that µ 0 (ad, clearly, 1 1) the spectral properties of S will be essetially the same as Σ. 3 Let us start by lookig ito a simple example, Σ = I. I that case, the distributio has o low dimesioal structure, as the distributio is rotatio ivariat. he followig is a histogram (left) ad a scree plot of the eigevalues of a sample of S (whe Σ = I) for p = 500 ad = he red lie is the eigevalue distributio predicted by the Marcheko-Pastur distributio (15), that we will discuss below. 3 I this case, S is actually the Maximum likelihood estimator for Σ, we ll talk about Maximum likelihood estimatio later i the course. 9

10 As oe ca see i the image, there are may eigevalues cosiderably larger tha 1 (ad some cosiderably larger tha others). Notice that, if give this profile of eigevalues of Σ oe could potetially be led to believe that the data has low dimesioal structure, whe i truth the distributio it was draw from is isotropic. Uderstadig the distributio of eigevalues of radom matrices is i the core of Radom Matrix heory (there are may good books o Radom Matrix heory, e.g. ao12] ad AGZ10]). his particular limitig distributio was first established i 1967 by Marcheko ad Pastur MP67] ad is ow referred to as the Marcheko-Pastur distributio. hey showed that, if p ad are both goig to with their ratio fixed p/ = γ 1, the sample distributio of the eigevalues of S (like the histogram above), i the limit, will be df γ (λ) = 1 (γ+ λ) (λ γ ) 1 2π γλ γ,γ + ](λ)dλ, (15) with support γ, γ + ]. his is plotted as the red lie i the figure above. Remark 1.4 We will ot show the proof of the Marcheko-Pastur heorem here (you ca see, for example, Bai99] for several differet proofs of it), but a approach to a proof is usig the so-called momet method. he core of the idea is to ote that oe ca compute momets of the eigevalue distributio i two ways ad ote that (i the limit) for ay k, ( ) ] 1 1 k p E r XX = 1p ( ) E r S k = E 1 p γ+ λ k i (S ) = λ k df γ (λ), p ad that the quatities 1 p E r ( 1 XX ) k ] ca be estimated (these estimates rely essetially i combiatorics). he distributio df γ (λ) ca the be computed from its momets A related ope problem Ope Problem 1.2 (Mootoicity of sigular values BKS13]) Cosider the settig above but with p =, the X R is a matrix with iid N (0, 1) etries. Let ( ) 1 σ i X, deote the i-th sigular value 4 of 1 X, ad defie α R () := E 1 as the expected value of the average sigular value of 1 X. he cojecture is that, for every 1, i=1 i=1 σ i ( 1 X) ], α R ( + 1) α R (). 4 he i-th diagoal elemet of Σ i the SVD 1 X = UΣV. γ 10

11 Moreover, for the aalogous quatity α C () defied over the complex umbers, meaig simply that each etry of X is a iid complex valued stadard gaussia CN (0, 1) the reverse iequality is cojectured for all 1: α C ( + 1) α C (). Notice that the sigular values of 1 X are simply the square roots of the eigevalues of S, ( ) 1 σ i X = λ i (S ). his meas that we ca compute α R i the limit (sice we kow the limitig distributio of λ i (S )) ad get (sice p = we have γ = 1, γ = 0, ad γ + = 2) lim α R() = 2 0 λ 1 2 df1 (λ) = 1 2π 2 λ (2 λ) λ λ = 8 3π g Also, α R (1) simply correspods to the expected value of the absolute value of a stadard gaussia α R (1) = E g = 2 π , which is compatible with the cojecture. O the complex valued side, the Marcheko-Pastur distributio also holds for the complex valued case ad so lim α C () = lim α R () ad α C (1) ca also be easily calculated ad see to be larger tha the limit. 2 Spike Models ad BPP trasitio What if there actually is some (liear) low dimesioal structure o the data? Whe ca we expect to capture it with PCA? A particularly simple, yet relevat, example to aalyse is whe the covariace matrix Σ is a idetity with a rak 1 perturbatio, which we refer to as a spike model Σ = I + βvv, for v a uit orm vector ad β 0. Oe way to thik about this istace is as each data poit x cosistig of a sigal part βg 0 v where g 0 is a oe-dimesioal stadard gaussia (a gaussia multiple of a fixed vector βv ad a oise part g N (0, I) (idepedet of g 0. he x = g + βg 0 v is a gaussia radom variable x N (0, I + βvv ). A atural questio is whether this rak 1 perturbatio ca be see i S. Let us build some ituitio with a example, the followig is the histogram of the eigevalues of a sample of S for p = 500, = 1000, v is the first elemet of the caoical basis v = e 1, ad β = 1.5: 11

12 he images suggests that there is a eigevalue of S that pops out of the support of the Marcheko-Pastur distributio (below we will estimate the locatio of this eigevalue, ad that estimate correspods to the red x ). It is worth oticig that the largest eigevalues of Σ is simply 1 + β = 2.5 while the largest eigevalue of S appears cosiderably larger tha that. Let us try ow the same experimet for β = 0.5: ad it appears that, for β = 0.5, the distributio of the eigevalues appears to be udistiguishable from whe Σ = I. his motivates the followig questio: Questio 2.1 For which values of γ ad β do we expect to see a eigevalue of S poppig out of the support of the Marcheko-Pastur distributio, ad what is the limit value that we expect it to take? 12

13 As we will see below, there is a critical value of β below which we do t expect to see a chage i the distributio of eivealues ad above which we expect oe of the eigevalues to pop out of the support, this is kow as BPP trasitio (after Baik, Be Arous, ad Péché BBAP05]). here are may very ice papers about this pheomeo, icludig Pau, Joh01, BBAP05, Pau07, BS05, Kar05]. 5 I what follows we will fid the critical value of β ad estimate the locatio of the largest eigevalue of S. While the argumet we will use ca be made precise (ad is borrowed from Pau]) we will be igorig a few details for the sake of expositio. I short, the argumet below ca be trasformed ito a rigorous proof, but it is ot oe at the preset form! First of all, it is ot difficult to see that we ca assume that v = e 1 (sice everythig else is rotatio ivariat). We wat to uderstad the behavior of the leadig eigevalue of S = 1 x i x i = 1 XX, i=1 where We ca write X as X = x 1,..., x ] R p. 1 + βz X = 1 Z 2 ], where Z 1 R 1 ad Z 2 R (p 1), both populated with i.i.d. stadard gaussia etries (N (0, 1)). he, S = 1 XX = 1 ] (1 + β)z 1 Z βz 1 1 Z 2 + βz 2 Z 1 Z2 Z. 2 Now, let ˆλ ad v = v1 v 2 ] where v 2 R p 1 ad v 1 R, deote, respectively, a eigevalue ad associated eigevector for S. By the defiitio of eigevalue ad eigevector we have ] ] ] 1 (1 + β)z 1 Z βz 1 1 Z 2 v1 + βz 2 Z 1 Z2 Z = ˆλ v1, 2 v 2 v 2 which ca be rewritte as (17) is equivalet to 1 (1 + β)z 1 Z 1 v βz 1 Z 2 v 2 = ˆλv 1 (16) βz 2 Z 1 v Z 2 Z 2 v 2 = ˆλv 2. (17) βz 2 Z 1 v 1 = (ˆλ I 1 ) Z2 Z 2 v 2. 5 Notice that the Marcheko-Pastur theorem does ot imply that all eigevalues are actually i the support of the Marchek-Pastur distributio, it just rules out that a o-vaishig proportio are. However, it is possible to show that ideed, i the limit, all eigevalues will be i the support (see, for example, Pau]). 13

14 If ˆλ I 1 Z 2 Z 2 is ivertible (this wo t be justified here, but it is i Pau]) the we ca rewrite it as ( v 2 = ˆλ I 1 ) 1 1 Z 2 Z βz 2 Z 1 v 1, which we ca the plug i (16) to get 1 (1 + β)z 1 Z 1 v βz 1 Z 2 (ˆλ I 1 ) 1 1 Z 2 Z βz 2 Z 1 v 1 = ˆλv 1 If v 1 0 (agai, ot properly justified here, see Pau]) the this meas that ˆλ = 1 (1 + β)z 1 Z βz 1 Z 2 (ˆλ I 1 ) 1 1 Z 2 Z βz 2 Z 1 (18) First observatio is that because Z 1 R has stadard gaussia etries the 1 Z 1 Z 1 1, meaig that ˆλ = (1 + β) ( Z1 Z 2 ˆλ I 1 ) ] 1 1 Z 2 Z 2 Z 2 Z 1. (19) Cosider the SVD of Z 2 = UΣV where U R p ad V R p p have orthoormal colums (meaig that U U = I p p ad V V = I p p ), ad Σ is a diagoal matrix. ake D = 1 Σ2 the 1 Z 2 Z 2 = 1 V Σ2 V = V DV, meaig that the diagoal etries of D correspod to the eigevalues of 1 Z 2 Z 2 which we expect to be distributed (i the limit) accordig to the Marcheko-Pastur distributio for p 1 γ. Replacig back i (19) ˆλ = (1 + β) ( UD ) (ˆλ ) Z 1 1/2 V 1 I V DV 1 ( UD ) ] 1/2 V Z1 = (1 + β) ( U ) (ˆλ ) Z 1 D 1/2 V 1 I V DV V D 1/2 ( U ) ] Z 1 = (1 + β) ( U ) ( ] Z 1 D 1/2 V V ˆλ ) I D V 1 V D 1/2 ( U ) ] Z 1 = (1 + β) ( U ) (ˆλ ]) 1 Z 1 D 1/2 I D D 1/2 ( U ) ] Z 1. Sice the colums of U are orthoormal, g := U Z 1 R p 1 is a isotropic gaussia (g N (0, 1)), i fact, Egg = EU ( Z 1 U ) Z 1 = EU Z 1 Z1 U = U E Z 1 Z1 ] U = U U = I (p 1) (p 1). We proceed ˆλ = (1 + β) (ˆλ ]) ] 1 g D 1/2 I D D 1/2 g = (1 + β) p 1 gj 2 D jj ˆλ D jj j=1 14

15 Because we expect the diagoal etries of D to be distributed accordig to the Marcheko-Pastur distributio ad g to be idepedet to it we expect that (agai, ot properly justified here, see Pau]) 1 p 1 p 1 We thus get a equatio for ˆλ: j=1 g 2 j ˆλ = (1 + β) D γ+ jj ˆλ D jj γ γ+ 1 + γ γ x ˆλ x df γ(x). ] x ˆλ x df γ(x), which ca be easily solved with the help of a program that computes itegrals symbolically (such as Mathematica) to give (you ca also see Pau] for a derivatio): ( ˆλ = (1 + β) 1 + γ ), (20) β which is particularly elegat (specially cosiderig the size of some the equatios used i the derivatio). A importat thig to otice is that for β = γ we have ˆλ = (1 + ( γ) 1 + γ ) = (1 + γ) 2 = γ +, γ suggestig that β = γ is the critical poit. Ideed this is the case ad it is possible to make the above argumet rigorous 6 ad show that i the model described above, If β γ the λ max (S ) γ +, ad if β > γ the ( λ max (S ) (1 + β) 1 + γ ) > γ +. β Aother importat questio is wether the leadig eigevector actually correlates with the plated perturbatio (i this case e 1 ). urs out that very similar techiques ca aswer this questio as well Pau] ad show that the leadig eigevector v max of S will be o-trivially correlated with e 1 if ad oly if β > γ, more precisely: If β γ the ad if β > γ the v max, e 1 2 0, v max, e γ β 2 1 γ β 6 Note that i the argumet above it was t eve completely clear where it was used that the eigevalue was actually the leadig oe. I the actual proof oe first eeds to make sure that there is a eigevalue outside of the support ad the proof oly holds for that oe, you ca see Pau]. 15

16 2.0.2 A brief metio of Wiger matrices Aother very importat radom matrix model is the Wiger matrix (ad it will show up later i this course). Give a iteger, a stadard gaussia Wiger matrix W R is a symmetric matrix with idepedet N (0, 1) etries (except for the fact that W ij = W ji ). I the limit, the eigevalues 1 of W are distributed accordig to the so-called semi-circular law dsc(x) = 1 4 x 2π 2 1 2,2] (x)dx, ad there is also a BPP like trasitio for this matrix esemble FP06]. More precisely, if v is a uit-orm vector i R 1 ad ξ 0 the the largest eigevalue of W + ξvv satisfies If ξ 1 the ad if ξ > 1 the λ max ( 1 W + ξvv ) 2, λ max ( 1 W + ξvv ) ξ + 1 ξ. (21) A ope problem about spike models Ope Problem 2.1 (Spike Model for cut SDP MS15]) Let W deote a symmetric Wiger matrix with i.i.d. etries W ij N (0, 1). Also, give B R symmetric, defie: Defie q(ξ) as What is the value of ξ, defied as Q(B) = max {r(bx) : X 0, X ii = 1}. ( 1 ξ q(ξ) = lim EQ ) W. ξ = if{ξ 0 : q(ξ) > 2}. It is kow that, if 0 ξ 1, q(ξ) = 2 MS15]. Oe ca show that 1 Q(B) λ max(b). I fact, max {r(bx) : X 0, X ii = 1} max {r(bx) : X 0, r X = }. It is also ot difficult to show (hit: take the spectral decompositio of X) that { } max r(bx) : X 0, X ii = = λ max (B). his meas that for ξ > 1, q(ξ) ξ + 1 ξ. i=1 16

17 Remark 2.1 Optimizatio problems of the type of max {r(bx) : X 0, X ii = 1} are semidefiite programs, they will be a major player later i the course! ( )] Sice 1 11 E r ξ W ξ, by takig X = 11 we expect that q(ξ) ξ. hese observatios imply that 1 ξ < 2 (see MS15]). A reasoable cojecture is that it is equal to 1. his would imply that a certai semidefiite programmig based algorithm for clusterig uder the Stochastic Block Model o 2 clusters (we will discuss these thigs later i the course) is optimal for detectio (see MS15]). 7 Refereces ABH14] E. Abbe, A. S. Badeira, ad G. Hall. Exact recovery i the stochastic block model. Available olie at arxiv: cs.si], AGZ10] G. W. Aderso, A. Guioet, ad O. Zeitoui. A itroductio to radom matrices. Cambridge studies i advaced mathematics. Cambridge Uiversity Press, Cambridge, New York, Melboure, Bai99] Z. D. Bai. Methodologies i spectral aalysis of large dimesioal radom matrices, a review. Statistics Siica, 9: , Ba15a] Ba15b] A. S. Badeira. Radom Laplacia matrices ad covex relaxatios. Available olie at arxiv: math.pr], A. S. Badeira. Relax ad Coquer BLOG: 18.S096 Pricipal Compoet Aalysis i High Dimesios ad the Spike Model BBAP05] J. Baik, G. Be-Arous, ad S. Péché. Phase trasitio of the largest eigevalue for oull complex sample covariace matrices. he Aals of Probability, 33(5): , BKS13] BS05] FP06] A. S. Badeira, C. Keedy, ad A. Siger. Approximatig the little grothedieck problem over the orthogoal group. Available olie at arxiv: cs.ds], J. Baik ad J. W. Silverstei. Eigevalues of large sample covariace matrices of spiked populatio models D. Féral ad S. Péché. he largest eigevalue of rak oe deformatio of large wiger matrices. Commuicatios i Mathematical Physics, 272(1): , Gol96] G. H. Golub. Matrix Computatios. Johs Hopkis Uiversity Press, third editio, HJ85] R. A. Hor ad C. R. Johso. Matrix Aalysis. Cambridge Uiversity Press, HM09] N. Halko, P. G. Martisso, ad J. A. ropp. Fidig structure with radomess: probabilistic algorithms for costructig approximate matrix decompositios. Available olie at arxiv: v2 math.na], Later i the course we will discuss clusterig uder the Stochastic Block Model quite thoroughly, ad will see how this same SDP is kow to be optimal for exact recovery ABH14, HWX14, Ba15a]. 17

18 HWX14] B. Hajek, Y. Wu, ad J. Xu. Achievig exact cluster recovery threshold via semidefiite programmig. Available olie at arxiv: , Joh01] I. M. Johsto. O the distributio of the largest eigevalue i pricipal compoets aalysis. he Aals of Statistics, 29(2): , Kar05] MM15] N. E. Karoui. Recet results about the largest eigevalue of radom covariace matrices ad statistical applicatio. Acta Physica Poloica B, 36(9), C. Musco ad C. Musco. Stroger ad faster approximate sigular value decompositio via the block laczos method. Available at arxiv: cs.ds], Mos11] M. S. Moslehia. Ky Fa iequalities. Available olie at arxiv: math.fa], MP67] MS15] MZ11] Pau] Pau07] Pea01] RS09] ao12] V. A. Marcheko ad L. A. Pastur. Distributio of eigevalues i certai sets of radom matrices. Mat. Sb. (N.S.), 72(114): , A. Motaari ad S. Se. Semidefiite programs o sparse radom graphs. Available olie at arxiv: cs.dm], S. Mallat ad O. Zeitoui. A cojecture cocerig optimality of the karhue-loeve basis i oliear recostructio. Available olie at arxiv: math.pr], D. Paul. Asymptotics of the leadig sample eigevalues for a spiked covariace model. Available olie at http: // aso. ucdavis. edu/ ~ debashis/ techrep/ eigelimit. pdf. D. Paul. Asymptotics of sample eigestructure for a large dimesioal spiked covariace model. Statistics Siica, 17: , K. Pearso. O lies ad plaes of closest fit to systems of poits i space. Philosophical Magazie, Series 6, 2(11): , V. Rokhli, A. Szlam, ad M. ygert. A radomized algorithm for pricipal compoet aalysis. Available at arxiv: stat.co], ao. opics i Radom Matrix heory. Graduate studies i mathematics. America Mathematical Soc.,

1 Principal Component Analysis in High Dimensions and the Spike Model

1 Principal Component Analysis in High Dimensions and the Spike Model Pricipal Compoet Aalysis i High Dimesios ad the Spike Model. Dimesio Reductio ad PCA Whe faced with a high dimesioal dataset, a atural approach is to try to reduce its dimesio, either by projectig it to