1 Principal Component Analysis in High Dimensions and the Spike Model

Size: px

Start display at page:

Download "1 Principal Component Analysis in High Dimensions and the Spike Model"

Wendy Powers
6 years ago
Views:

1 Pricipal Compoet Aalysis i High Dimesios ad the Spike Model. Dimesio Reductio ad PCA Whe faced with a high dimesioal dataset, a atural approach is to try to reduce its dimesio, either by projectig it to a lower dimesio space or by fidig a better represetatio for the data. Durig this course we will see a few differet ways of doig dimesio reductio. We will start with Pricipal Compoet Aalysis (PCA. I fact, PCA cotiues to be oe of the best (ad simplest tools for exploratory data aalysis. Remarkably, it dates back to a 90 paper by Karl Pearso Pea0]! Let s say we have data poits x,..., x i R p, for some p, ad we are iterested i (liearly projectig the data to d < p dimesios. his is particularly useful if, say, oe wats to visualize the data i two or three dimesios. here are a couple of differet ways we ca try to choose this projectio:. Fidig the d-dimesioal affie subspace for which the projectios of x,..., x o it best approximate the origial poits x,..., x.. Fidig the d dimesioal projectio of x,..., x that preserved as much variace of the data as possible. As we will see below, these two approaches are equivalet ad they correspod to Pricipal Compoet Aalysis. Before proceedig, we recall a couple of simple statistical quatities associated with x,..., x, that will reappear below. Give x,..., x we defie its sample mea as ad its sample covariace as µ = x k, (4 k= Σ = ( x k µ (x k µ. (5 k= Remark. If x,..., x are idepedetly sampled from a distributio, µ ad Σ are ubiased estimators for, respectively, the mea ad covariace of the distributio. We will start with the first iterpretatio of PCA ad the show that it is equivalet to the secod... PCA as best d-dimesioal affie fit We are tryig to approximate each x k by d x k µ + (β k i v i, (6 i= 0

2 where v,..., v d is a orthoormal basis for the d-dimesioal subspace, µ R p represets the traslatio, ad β k correspods to the coefficiets of x k. If we represet the subspace by V = v p v d ] R d the we ca rewrite (7 as x k µ + V β k, (7 where V V = I d d as the vectors v i are orthoormal. We will measure goodess of fit i terms of least squares ad attempt to solve mi x k (µ + V β k (8 µ, V, β k V V =I k= We start by optimizig for µ. It is easy to see that the first order coditios for µ correspod to µ x k (µ + V β k = 0 (x k (µ + V β k = 0. k= k= hus, the optimal value µ of µ satisfies ( ( µ V βk = 0. x k k= k= Because k= β k = 0 we have that the optimal µ is give by µ = x k = µ, k= the sample mea. We ca the proceed o fidig the solutio for (9 by solvig mi xk µ V β k. (9 V, β k V V =I k= Let us proceed by optimizig for β k. Sice the problem decouples for each k, we ca focus o, for each k, d mi x k µ V β k = mi x k µ ( βk i v i. (0 β β k k Sice v,..., vd are orthoormal, it is easy to see that the solutio is give by (β k i = vi (x k µ which ca be succictly writte as β k = V (x k µ. hus, (9 is equivalet to i= mi (x k µ V V (x k µ. ( V V =I k=

3 Note that (x k µ V V (x k µ = (x k µ (x k µ (x k µ V V (x k µ ( + (xk µ V V V V (x k µ = (x k µ (x k µ (x k µ V V (x k µ. Sice (x k µ (x k µ does ot deped o V, miimizig (9 is equivalet to max (x k µ V V (x k µ. ( V V =I k= A few more simple algebraic maipulatios usig properties of the trace: (x k µ V V (x k µ = ] r (xk µ V V (x k µ k= k= = ] r V (x k µ (x k µ V k= = r V (x k µ (x k µ V k= = ( r V ] Σ V. ] his meas that the solutio to (3 is give by max r V ] Σ V. V V =I (3 As we saw above (recall ( the solutio is give by V = v,, v d ] where v,..., v d correspod to the d leadig eigevectors of Σ. Let us first show that iterpretatio ( of fidig the d-dimesioal projectio of x,..., x that preserves the most variace also arrives to the optimizatio problem (3... PCA as d-dimesioal projectio that preserves the most variace We aim to fid a orthoormal basis v,..., vd (orgaized as V = v,..., v d ] with V V = I d d of a d-dimesioal space such that the projectio of x,..., x projected o this subspace has the most variace. Equivaletly we ca ask for the poits v x k., vd x k k=

4 to have as much variace as possible. Hece, we are iterested i solvig Note that max V x k V x r. (4 V V =I k= r= V xk V x r = ( V (x k µ = r V Σ V, k= r= k= showig that (4 is equivalet to (3 ad that the two iterpretatios of PCA are ideed equivalet...3 Fidig the Pricipal Compoets Whe give a dataset x,..., x R p, i order to compute the Pricipal Compoets oe eeds to fid the leadig eigevectors of Σ = ( xk µ (x k µ. k= A aive way of doig this would be to costruct Σ (which takes O(p work ad the fidig its spectral decompositio (which takes O(p 3 work. his meas that the computatioal complexity of this procedure is O ( max { p, p 3 } (see HJ85] ad/or Gol96]. A alterative is to use the Sigular Value Decompositio (. Let X = x x ] recall that, Σ = ( X ( X µ. µ Let us take the SVD of X µ = ULDUR with U L O(p, D diagoal, ad UR U R = I. he, L R R L L UL, Σ = ( X µ ( X µ = U DU U DU = U D meaig that U L correspod to the eigevectors of Σ. Computig the SVD of X µ takes O(mi p, p but if oe is iterested i simply computig the top d eigevectors the this computatioal costs reduces to O(dp. his ca be further improved with radomized algorithms. here are radomized algorithms that compute a approximate solutio i O ( p log d + (p + d time (see for example HM09, RS09, MM5]...4 Which d should we pick? Give a dataset, if the objective is to visualize it the pickig d = or d = 3 might make the most sese. However, PCA is useful for may other purposes, for example: ( ofte times the data belogs to a lower dimesioal space but is corrupted by high dimesioal oise. Whe usig PCA it is oftetimess possible to reduce the oise while keepig the sigal. ( Oe may be iterested i ruig a algorithm that would be too computatioally expesive to ru i high dimesios, If there is time, we might discuss some of these methods later i the course. 3

5 dimesio reductio may help there, etc. I these applicatios (ad may others it is ot clear how to pick d. (+ If we deote the k-th largest eigevalue of Σ as λ (Σ, the the k-th pricipal compoet has (+ λ (Σ k a r(σ proportio of the variace. A fairly popular heuristic is to try to choose the cut-off at a compoet that has sigificatly more variace tha the oe immediately after. his is usually visualized by a scree plot: a plot of the values of the ordered eigevalues. Here is a example: k It is commo to the try to idetify a elbow o the scree plot to choose the cut-off. I the ext Sectio we will look ito radom matrix theory to try to uderstad better the behavior of the eigevalues of Σ ad it will help us uderstad whe to cut-off...5 A related ope problem We ow show a iterestig ope problem posed by Mallat ad Zeitoui at MZ] Ope Problem. (Mallat ad Zeitoui MZ] Let g N (0, Σ be a gaussia radom vector i R p with a kow covariace matrix Σ ad d < p. Now, for ay orthoormal basis V = v,..., v p ] of R p, cosider the followig radom variable Γ V : Give a draw of the radom vector g, Γ V is the squared l orm of the largest projectio of g o a subspace geerated by d elemets of the basis V. he questio is: What is the basis V for which E Γ V ] is maximized? Note that r (Σ = p k= λ k (Σ. 4

6 he cojecture i MZ] is that the optimal basis is the eigedecompositio of Σ. It is kow that this is the case for d = (see MZ] but the questio remais ope for d >. It is ot very difficult to see that oe ca assume, without loss of geerality, that Σ is diagoal. A particularly ituitive way of statig the problem is:. Give Σ R p p ad d. Pick a orthoormal basis v,..., v p 3. Give g N (0, Σ 4. Pick d elemets ṽ,..., ṽ d of the basis d 5. Score: ( i= ṽ i g he objective is to pick the basis i order to maximize the expected value of the Score. Notice that if the steps of the procedure were take i a slightly differet order o which step 4 would take place before havig access to the draw of g (step 3 the the best basis is ideed the eigebasis of Σ ad the best subset of the basis is simply the leadig eigevectors (otice the resemblace with PCA, as described above. More formally, we ca write the problem as fidig ( argmax E max vi g, V R p p V V =I S p] S =d i S where g N (0, Σ. he observatio regardig the differet orderig of the steps amouts to sayig that the eigebasis of Σ is the optimal solutio for i g argmax max E ] ( v. V R p p S p] V V =I S =d i S. PCA i high dimesios ad Marceko-Pastur Let us assume that the data poits x,..., x R are idepedet draws of a gaussia radom variable g N (0, Σ for some covariace Σ R p p. I this case whe we use PCA we are hopig to fid low dimesioal structure i the distributio, which should correspod to large eigevalues of Σ (ad their correspodig eigevectors. For this reaso (ad sice PCA depeds o the spectral properties of Σ we would like to uderstad whether the spectral properties of Σ (eigevalues ad eigevectors are close to the oes of Σ. Sice EΣ = Σ, if p is fixed ad the law of large umbers guaratees that ideed Σ Σ. However, i may moder applicatios it is ot ucommo to have p i the order of (or, sometimes, eve larger!. For example, if our dataset is composed by images the is the umber of images ad p the umber of pixels per image; it is coceivable that the umber of pixels be o the order of the umber of images i a set. Ufortuately, i that case, it is o loger clear that Σ Σ. Dealig with this type of difficulties is the realm of high dimesioal statistics. 5 p

7 For simplicity we will istead try to uderstad the spectral properties of S = XX. Sice x N (0, Σ we kow that µ 0 (ad, clearly, the spectral properties of S will be essetially the same as Σ. 3 Let us start by lookig ito a simple example, Σ = I. I that case, the distributio has o low dimesioal structure, as the distributio is rotatio ivariat. he followig is a histogram (left ad a scree plot of the eigevalues of a sample of S (whe Σ = I for p = 500 ad = 000. he red lie is the eigevalue distributio predicted by the Marcheko-Pastur distributio (5, that we will discuss below. As oe ca see i the image, there are may eigevalues cosiderably larger tha (ad some cosiderably larger tha others. Notice that, if give this profile of eigevalues of Σ oe could potetially be led to believe that the data has low dimesioal structure, whe i truth the distributio it was draw from is isotropic. Uderstadig the distributio of eigevalues of radom matrices is i the core of Radom Matrix heory (there are may good books o Radom Matrix heory, e.g. ao] ad AGZ0]. his particular limitig distributio was first established i 967 by Marcheko ad Pastur MP67] ad is ow referred to as the Marcheko-Pastur distributio. hey showed that, if p ad are both goig to with their ratio fixed p/ = γ, the sample distributio of the eigevalues of S (like the histogram above, i the limit, will be df γ (λ = (γ + λ (λ γ γ,γ π γλ + ](λdλ, (5 3 I this case, S is actually the Maximum likelihood estimator for Σ, we ll talk about Maximum likelihood estimatio later i the course. 6

8 with support γ, γ + ]. his is plotted as the red lie i the figure above. Remark. We will ot show the proof of the Marcheko-Pastur heorem here (you ca see, for example, Bai99] for several differet proofs of it, but a approach to a proof is usig the so-called momet method. he core of the idea is to ote that oe ca compute momets of the eigevalue distributio i two ways ad ote that (i the limit for ay k, ( ] k ( p γ+ E r XX = E r S k = E λ k i (S k = λ df γ (λ, p p p γ i= ( ad that the quatities E r XX ] k p ca be estimated (these estimates rely essetially i combi- atorics. he distributio df γ (λ ca the be computed from its momets... A related ope problem Ope Problem. (Mootoicity of sigular values BKS3a] Cosider the settig above but with p =, the X R is a matrix with iid N (0, etries. Let ( σ i X, deote the i-th sigular value 4 of X, ad defie ( ] α R ( := E σ i X, as the expected value of the average sigular value of X. he cojecture is that, for every, i= α R ( + α R (. Moreover, for the aalogous quatity α C ( defied over the complex umbers, meaig simply that each etry of X is a iid complex valued stadard gaussia CN (0, the reverse iequality is cojectured for all : α C ( + α C (. Notice that the sigular values of X are simply the square roots of the eigevalues of S, ( σ i X = λi (S. 4 he i-th diagoal elemet of Σ i the SVD X = UΣV. 7

9 his meas that we ca compute α R i the limit (sice we kow the limitig distributio of λ i (S ad get (sice p = we have γ =, γ = 0, ad γ+ = ( λ λ 8 lim α R ( = λ df (λ = λ = π λ 3π Also, α R ( simply correspods to the expected value of the absolute value of a stadard gaussia g α R ( = E g = π , which is compatible with the cojecture. O the complex valued side, the Marcheko-Pastur distributio also holds for the complex valued case ad so lim α C ( = lim αr( ad α C ( ca also be easily calculated ad see to be larger tha the limit..3 Spike Models ad BBP trasitio What if there actually is some (liear low dimesioal structure o the data? Whe ca we expect to capture it with PCA? A particularly simple, yet relevat, example to aalyse is whe the covariace matrix Σ is a idetity with a rak perturbatio, which we refer to as a spike model Σ = I + βvv, for v a uit orm vector ad β 0. Oe way to thik about this istace is as each data poit x cosistig of a sigal part βg 0 v where g 0 is a oe-dimesioal stadard gaussia (a gaussia multiple of a fixed vector βv ad a oise part g N (0, I (idepedet of g 0. he x = g + βg 0 v is a gaussia radom variable x N (0, I + βvv. A atural questio is whether this rak perturbatio ca be see i S. Let us build some ituitio with a example, the followig is the histogram of the eigevalues of a sample of S for p = 500, = 000, v is the first elemet of the caoical basis v = e, ad β =.5: 0 8

he images suggests that there is a eigevalue of S that pops out of the support of the Marcheko-Pastur distributio (below we will estimate the locatio of this eigevalue, ad that estimate correspods to

10 he images suggests that there is a eigevalue of S that pops out of the support of the Marcheko-Pastur distributio (below we will estimate the locatio of this eigevalue, ad that estimate correspods to the red x. It is worth oticig that the largest eigevalues of Σ is simply + β =.5 while the largest eigevalue of S appears cosiderably larger tha that. Let us try ow the same experimet for β = 0.5: ad it appears that, for β = 0.5, the distributio of the eigevalues appears to be udistiguishable from whe Σ = I. his motivates the followig questio: Questio.3 For which values of γ ad β do we expect to see a eigevalue of S poppig out of the support of the Marcheko-Pastur distributio, ad what is the limit value that we expect it to take? As we will see below, there is a critical value of β below which we do t expect to see a chage i the distributio of eivealues ad above which we expect oe of the eigevalues to pop out of the support, this is kow as BBP trasitio (after Baik, Be Arous, ad Péché BBAP05]. here are may very ice papers about this ad similar pheomea, icludig Pau, Joh0, BBAP05, Pau07, BS05, Kar05, BGN, BGN]. 5 I what follows we will fid the critical value of β ad estimate the locatio of the largest eigevalue of S. While the argumet we will use ca be made precise (ad is borrowed from Pau] we will be igorig a few details for the sake of expositio. I short, the argumet below ca be trasformed ito a rigorous proof, but it is ot oe at the preset form! First of all, it is ot difficult to see that we ca assume that v = e (sice everythig else is rotatio ivariat. We wat to uderstad the behavior of the leadig eigevalue of S = xix i = XX, i= 5 Notice that the Marcheko-Pastur theorem does ot imply that all eigevalues are actually i the support of the Marchek-Pastur distributio, it just rules out that a o-vaishig proportio are. However, it is possible to show that ideed, i the limit, all eigevalues will be i the support (see, for example, Pau]. 9

11 where We ca write X as X = x,..., x ] R p. X = ] + βz Z, where Z R ad Z R (p, both populated with i.i.d. stadard gaussia etries (N (0,. he, ( + βz Z + βz S = XX = Z ] + βz Z. Z Z ] Now, let λˆ v ad v = where v R p ad v R, deote, respectively, a eigevalue ad v associated eigevector for S. By the defiitio of eigevalue ad eigevector we have ( + βz Z ] ] ] + βz Z v = λ ˆ v + βz Z Z Z, v v which ca be rewritte as (7 is equivalet to ( + βz ˆ Z v + + βz Z v = λv (6 + βz ˆ Z v + Z Z v = λv. (7 ( + βz Z v = ˆλ I Z Z v. If λˆ I Z Z is ivertible (this wo t be justified here, but it is i Pau] the we ca rewrite it as which we ca the plug i (6 to get ( ˆ v = λ I Z Z + βz Z v, ( + βz Z v + + βz Z (ˆ λ I Z ˆ Z + βz Z v = λv If v = 0 (agai, ot properly justified here, see Pau] the this meas that ˆ ˆ λ = ( + βz Z + ( + βz Z λ I Z Z + βz Z (8 First observatio is that because Z R has stadard gaussia etries the Z Z, meaig that ( ] ˆ ˆ λ = ( + β + Z Z λ I Z Z Z Z. (9 0

12 Cosider the SVD of Z = UΣV where U R p ad V R p p have orthoormal colums (meaig that U U = I p p ad V V = I p p, ad Σ is a diagoal matrix. ake D = Σ the Z Z = V Σ V = V DV, meaig that the diagoal etries of D correspod to the eigevalues of Z Z which we expect to p be distributed (i the limit accordig to the Marcheko-Pastur distributio for γ. Replacig back i (9 ( ˆ UD ( ˆ ( / ] / λ = ( + β + Z V λ I V DV UD V Z ] ( = ( + β U ( Z V ˆ D / + λ I V DV V D / ( U Z ( = ( + β + U ( ] Z D / V V λˆ ] I D V V D / ( U Z = ( + β + ( U Z ( D / ˆ λ I D] D / ( U ] Z. Sice the colums of U are orthoormal, g := U Z R p is a isotropic gaussia (g N (0,, i fact, Egg = EU Z ( U Z = EU Z Z U = U E Z Z ] U = U U = I(p (p. We proceed ˆλ = ( + β + g D p / (ˆ ] λ I D = ( + β + g D jj j ˆλ Djj j= ] D / g Because we expect the diagoal etries of D to be distributed accordig to the Marcheko-Pastur distributio ad g to be idepedet to it we expect that (agai, ot properly justified here, see Pau] γ+ p j g D j x p j df λˆ D jj ˆ γ (x. j= γ λ x We thus get a equatio for λ: ˆ γ + ] x ˆλ = ( + β + γ df γ ( x, λˆ x which ca be easily solved with the help of a program that computes itegrals symbolically (such as Mathematica to give (you ca also see Pau] for a derivatio: ( γ ˆλ = ( + β +, (0 β γ

13 which is particularly elegat (specially cosiderig the size of some the equatios used i the derivatio. A importat thig to otice is that for β = γ we have ( γ ˆλ = ( + γ + γ = ( + γ = γ +, suggestig that β = γ is the critical poit. Ideed this is the case ad it is possible to make the above argumet rigorous 6 ad show that i the model described above, If β γ the λ max (S γ +, ad if β > γ the ( γ λ max (S ( + β + > γ +. β Aother importat questio is wether the leadig eigevector actually correlates with the plated perturbatio (i this case e. urs out that very similar techiques ca aswer this questio as well Pau] ad show that the leadig eigevector v max of S will be o-trivially correlated with e if ad oly if β > γ, more precisely: If β γ the v max, e 0, ad if β > γ the β v max, e γ. γ β.3. A brief metio of Wiger matrices Aother very importat radom matrix model is the Wiger matrix (ad it will show up later i this course. Give a iteger, a stadard gaussia Wiger matrix W R is a symmetric matrix with idepedet N (0, etries (except for the fact that W ij = W ji. I the limit, the eigevalues W are distributed accordig to the so-called semi-circular law of dsc(x = 4 x π,] (xdx, ad there is also a BBP like trasitio for this matrix esemble FP06]. More precisely, if v is a uit-orm vector i R ad ξ 0 the the largest eigevalue of W + ξvv satisfies 6 Note that i the argumet above it was t eve completely clear where it was used that the eigevalue was actually the leadig oe. I the actual proof oe first eeds to make sure that there is a eigevalue outside of the support ad the proof oly holds for that oe, you ca see Pau]

14 If ξ the ad if ξ > the λ max ( W + ξvv, ( λ max W + ξvv ξ +. ( ξ.3. A ope problem about spike models Ope Problem.3 (Spike Model for cut SDP MS5]. As sice bee solved MS5] Let W deote a symmetric Wiger matrix with i.i.d. etries W ij N (0,. Also, give B R symmetric, defie: Q(B = max {r(bx : X 0, X ii = }. Defie q(ξ as What is the value of ξ, defied as q(ξ = lim ( ξ EQ + W. ξ = if {ξ 0 : q(ξ > }. It is kow that, if 0 ξ, q(ξ = MS5]. Oe ca show that Q(B λ max (B. I fact, max {r(bx : X 0, X ii = } max {r(bx : X 0, r X = }. It is also ot difficult to show (hit: take the spectral decompositio of X that { } max r(bx : X 0, X ii = = λ max (B. his meas that for ξ >, q(ξ ξ +. ξ i= Remark.4 Optimizatio problems of the type of max {r(bx : X 0, X ii = } are semidefiite programs, they will be a major player later i the course! ( ] Sice E r ξ + W ξ, by takig X = we expect that q(ξ ξ. hese observatios imply that ξ < (see MS5]. A reasoable cojecture is that it is equal to. his would imply that a certai semidefiite programmig based algorithm for clusterig uder the Stochastic Block Model o clusters (we will discuss these thigs later i the course is optimal for detectio (see MS5]. 7 Remark.5 We remark that Ope Problem.3 as sice bee solved MS5]. 7 Later i the course we will discuss clusterig uder the Stochastic Block Model quite thoroughly, ad will see how this same SDP is kow to be optimal for exact recovery ABH4, HWX4, Ba5c]. 3

15 MI OpeCourseWare 8.S096 opics i Mathematics of Data Sciece Fall 05 For iformatio about citig these materials or our erms of Use, visit:

18.S096: Principal Component Analysis in High Dimensions and the Spike Model

18.S096: Principal Component Analysis in High Dimensions and the Spike Model 18.S096: Pricipal Compoet Aalysis i High Dimesios ad the Spike Model opics i Mathematics of Data Sciece (Fall 2015) Afoso S. Badeira badeira@mit.edu http://math.mit.edu/~badeira September 18, 2015 hese