Principal Cmpnents Suppse we have N measurements n each f p variables X j, j = 1,..., p. There are several equivalent appraches t principal cmpnents: Given X = (X 1,... X p ), prduce a derived (and small) set f uncrrelated variables Z k = Xα k, k = 1,..., q < p that are linear cmbinatins f the riginal variables, and that explain mst f the variatin in the riginal set. Apprximate the riginal set f N pints in IR p by a least-squares ptimal linear manifld f c-dimensin q < p. Apprximate the N p data matrix X by the best rank-q matrix ˆX (q). This is the usual mtivatin fr the SVD. SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 1
SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 2 PC: Derived Variables -4-2 0 2 4-4 -2 0 2 4 Largest Principal Cmpnent Smallest Principal Cmpnent replacements X 1 X2 Z 1 = Xα 1 is the prjectin f the data nt the lngest directin, and has the largest variance amngst all such nrmalized prjectins. α 1 is the eigenvectr crrespnding t the largest eigenvalue f ˆΣ, the sample cvariance matrix f X. Z 2 and α 2 crrespnd t the secnd-largest eigenvectr.
PC: Least Squares Apprximatin Find the linear manifld f(λ) = µ + V q λ that best apprximates the data in a least-squares sense: min µ,{λ i }, V q N x i µ V q λ i 2. i=1 Slutin: µ = x, v k = α k, λ k = V T q (x i x). SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 3
PC: Singular Value Decmpsitin Let X be the N p data matrix with centered clumns (assume N > p). is the SVD f X, where X = UDV T U is N p rthgnal, the left singular vectrs. V is p p rthgnal, the right singular vectrs. D is diagnal, with d 1 d 2... d p 0, the singular values. The SVD always exists, and is unique up t signs. The clumns f V are the principal cmpnents, and Z j = U j d j. Let D q be D, with all but the first q diagnal elements set t zer. Then ˆX q = UD q V T slves min X ˆX q rank( ˆX q )=q SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 4
PC: Example Digit Data 130 threes, a subset f 638 such threes and part f the handwritten digit dataset. Each three is a 16 16 greyscale image, and the variables X j, j = 1,..., 256 are the greyscale values fr each pixel. SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 5
SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 6 Rank-2 Mdel fr Threes First Principal Cmpnent Secnd Principal Cmpnent -6-4 -2 0 2 4 6 8-5 0 5 Tw-cmpnent mdel has the frm ˆf(λ) = x + λ 1 v 1 + λ 2 v 2 = + λ 1 + λ 2. Here we have displayed the first tw principal cmpnent directins, v 1 and v 2, as images.
SVD: Expressin Arrays The rws are genes (variables) and the clumns are bservatins (samples, DNA arrays). Typically 6-10K genes, 50 samples. SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 7
Eigengenes The first principal cmpnent r eigengene is the linear cmbinatin f the genes shwing the mst variatin ver the samples. The individual gene ladings fr each eigengene r eigenarrays can have bilgical meaning. The sample values fr the eigengenes shw useful lw-dimensinal prjectins. SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 8
Example: NCI Cancer Data First tw eigengenes Pints are clred accrding t NCI cancer classes Lading fr PC-1-0.10 0.0 Principal Cmpnent 2-0.2-0.1 0.0 0.1 0.2 Lading fr PC-2-0.05 0.0 0.05 1 9 9 9 2 2 6 6 9 2 2 2 7 9 1 9 6 6 2 9 7 1 5 6 5 5 5 1 8 5 61 1 6 5 6 4 6 7 7 7 7 4 8 1 34 3 3 4 3 3 3 3 1 4 4 0.04 0.08 0.12 0.16 Principal Cmpnent 1 First tw eigenarrays 0 2000 4000 6000 8000 Gene 0 2000 4000 6000 8000 Gene SL&DM c Hastie & Tibshirani January 25, 2010 Dimensin Reductin: 9