Lecture 3: Principal Cmpnents Analysis (PCA) Reading: Sectins 6.3.1, 10.1, 10.2, 10.4 STATS 202: Data mining and analysis Jnathan Taylr, 9/28 Slide credits: Sergi Bacallad 1 / 24
The bias variance decmpsitin The inputs, x 1,..., x n are fixed, a test pint x 0 is als fixed. y i = f(x i ) + ε i ε i i.i.d, mean 0. A regressin methd fit t (x 1, y 1 ),..., (x n, y n ) prduces the estimate ˆf. Then, the Mean Squared Errr at x 0 satisfies: MSE(x 0 ) = E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 ))+[Bias( ˆf(x 0 ))] 2 +Var(ε). 2 / 24
The bias variance decmpsitin The inputs, x 1,..., x n are fixed, a test pint x 0 is als fixed. y i = f(x i ) + ε i ε i i.i.d, mean 0. A regressin methd fit t (x 1, y 1 ),..., (x n, y n ) prduces the estimate ˆf. Then, the Mean Squared Errr at x 0 satisfies: MSE(x 0 ) = E(y 0 ˆf(x 0 )) 2 = Var( ˆf(x 0 ))+[Bias( ˆf(x 0 ))] 2 +Var(ε). Bth variance and squared bias are always psitive, s t minimize the MSE, yu must reach a tradeff between bias and variance. 2 / 24
Squiggly f, high nise Linear f, high nise Squiggly f, lw nise 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.5 1.0 1.5 2.0 2.5 0 5 10 15 20 MSE Bias Var 2 5 10 20 Flexibility 2 5 10 20 Flexibility 2 5 10 20 Flexibility Figure 2.12 3 / 24
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. 4 / 24
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. The mdel: Y = f(x) + ε becmes insufficient, as X is nt necessarily real-valued. 4 / 24
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. The mdel: Y = f(x) + ε becmes insufficient, as X is nt necessarily real-valued. 4 / 24
Classificatin prblems In a classificatin setting, the utput takes values in a discrete set. Fr example, if we are predicting the brand f a car based n a number f variables, the functin f takes values in the set {Frd, Tyta, Mercedes-Benz,... }. We will use slightly different ntatin: P (X, Y ) : jint distributin f (X, Y ), P (Y X) : cnditinal distributin f Y given X, ŷ i : predictin fr x i. 4 / 24
Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: E(1(y 0 ŷ 0 )) 5 / 24
Lss functin fr classificatin There are many ways t measure the errr f a classificatin predictin. One f the mst cmmn is the 0-1 lss: E(1(y 0 ŷ 0 )) Like the MSE, this quantity can be estimated frm training and test data by taking a sample average: 1 n n 1(y i ŷ i ) i=1 5 / 24
Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. 6 / 24
Bayes classifier X1 X2 Figure 2.13 In practice, we never knw the jint prbability P. Hwever, we can assume that it exists. The Bayes classifier assigns: ŷ i = argmax j P (Y = j X = x i ) It can be shwn that this is the best classifier under the 0-1 lss. 6 / 24
Principal Cmpnents Analysis This is the mst ppular unsupervised prcedure ever. Invented by Karl Pearsn (1901). Develped by Harld Htelling (1933). 7 / 24
Principal Cmpnents Analysis This is the mst ppular unsupervised prcedure ever. Invented by Karl Pearsn (1901). Develped by Harld Htelling (1933). Stanfrd pride! 7 / 24
Principal Cmpnents Analysis This is the mst ppular unsupervised prcedure ever. Invented by Karl Pearsn (1901). Develped by Harld Htelling (1933). What des it d? It prvides a way t visualize high dimensinal data, summarizing the mst imprtant infrmatin. 7 / 24
What is PCA gd fr? Murder 50 150 250 10 20 30 40 5 10 15 50 100 200 300 Assault UrbanPp 30 40 50 60 70 80 90 5 10 15 10 20 30 40 30 50 70 90 Rape 8 / 24
What is PCA gd fr? 0.5 0.0 0.5 Secnd Principal Cmpnent 3 2 1 0 1 2 3 UrbanPp Hawaii Rhde Massachusetts Island Utah Califrnia New Jersey Cnnecticut Washingtn Clrad New Yrk Ohi Illinis Arizna Nevada Wiscnsin Minnesta Pennsylvania Oregn Rape Texas Kansas Oklahma Delaware Nebraska Missuri Iwa Indiana Michigan New Hampshire Flrida Idah Virginia New Mexic Maine Wyming Maryland rth Dakta Mntana Assault Suth Dakta Tennessee Kentucky Luisiana Arkansas Alabama Alaska Gergia VermntWest Virginia Murder Suth Carlina Nrth Carlina Mississippi 0.5 0.0 0.5 3 2 1 0 1 2 3 First Principal Cmpnent Figure 10.1 8 / 24
What is the first principal cmpnent? It is the vectr which passes the clsest t a clud f samples, in terms f squared Euclidean distance. Ad Spending 0 5 10 15 20 25 30 35 10 20 30 40 50 60 70 Ppulatin 9 / 24
i.e. The green directin minimizes the average squared length f the dtted lines. Ad Spending 5 10 15 20 25 30 2nd Principal Cmpnent 10 5 0 5 10 20 30 40 50 Ppulatin 20 10 0 10 20 1st Principal Cmpnent Figure 6.15 10 / 24
What des this lk like with 3 variables? The first tw principal cmpnents span a plane which is clsest t the data. First principal cmpnent Secnd principal cmpnent 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 Figure 10.2 11 / 24
A secnd interpretatin The prjectin nt the first principal cmpnent is the ne with the highest variance. Ad Spending 5 10 15 20 25 30 2nd Principal Cmpnent 10 5 0 5 10 20 30 40 50 Ppulatin 20 10 0 10 20 1st Principal Cmpnent Figure 6.15 12 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. 13 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. Frm each variable, we subtract the mean f the clumn; i.e. we center the variables. 13 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. Frm each variable, we subtract the mean f the clumn; i.e. we center the variables. T find the first principal cmpnent φ 1 = (φ 11,..., φ p1 ), we slve the fllwing ptimizatin 13 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. Frm each variable, we subtract the mean f the clumn; i.e. we center the variables. T find the first principal cmpnent φ 1 = (φ 11,..., φ p1 ), we slve the fllwing ptimizatin max φ 11,...,φ p1 1 n n i=1 2 p φ j1 x ij j=1 subject t p φ 2 j1 = 1. j=1 13 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. Frm each variable, we subtract the mean f the clumn; i.e. we center the variables. T find the first principal cmpnent φ 1 = (φ 11,..., φ p1 ), we slve the fllwing ptimizatin max φ 11,...,φ p1 1 n n i=1 2 p φ j1 x ij j=1 subject t p φ 2 j1 = 1. Prjectin f the ith sample nt φ 1. Als knwn as the scre z i1 j=1 13 / 24
Hw d we say this in math? Let X be a data matrix with n samples, and p variables. Frm each variable, we subtract the mean f the clumn; i.e. we center the variables. T find the first principal cmpnent φ 1 = (φ 11,..., φ p1 ), we slve the fllwing ptimizatin max φ 11,...,φ p1 1 n n i=1 2 p φ j1 x ij j=1 subject t p φ 2 j1 = 1. j=1 Variance f the n samples prjected nt φ 1. 13 / 24
Hw d we say this in math? T find the secnd principal cmpnent φ 2 = (φ 12,..., φ p2 ), we slve the fllwing ptimizatin 14 / 24
Hw d we say this in math? T find the secnd principal cmpnent φ 2 = (φ 12,..., φ p2 ), we slve the fllwing ptimizatin subject t max φ 12,...,φ p2 1 n p φ 2 j2 = 1 j=1 2 n p φ j2 x ij i=1 and j=1 p φ j1 φ j2 = 0. j=1 14 / 24
Hw d we say this in math? T find the secnd principal cmpnent φ 2 = (φ 12,..., φ p2 ), we slve the fllwing ptimizatin subject t max φ 12,...,φ p2 1 n p φ 2 j2 = 1 j=1 2 n p φ j2 x ij i=1 and j=1 p φ j1 φ j2 = 0. j=1 First and secnd principal cmpnents must be rthgnal. 14 / 24
Hw d we say this in math? T find the secnd principal cmpnent φ 2 = (φ 12,..., φ p2 ), we slve the fllwing ptimizatin subject t max φ 12,...,φ p2 1 n p φ 2 j2 = 1 j=1 2 n p φ j2 x ij i=1 and j=1 p φ j1 φ j2 = 0. j=1 First and secnd principal cmpnents must be rthgnal. Equivalent t saying that the scres (z 11,..., z n1 ) and (z 12,..., z n2 ) are uncrrelated. 14 / 24
Slving the ptimizatin This ptimizatin is fundamental in linear algebra. It is satisfied by either: 15 / 24
Slving the ptimizatin This ptimizatin is fundamental in linear algebra. It is satisfied by either: The singular value decmpsitin (SVD) f X: X = UΣΦ T where the ith clumn f Φ is the ith principal cmpnent φ i, and the ith clumn f UΣ is the ith vectr f scres (z 1i,..., z ni ). 15 / 24
Slving the ptimizatin This ptimizatin is fundamental in linear algebra. It is satisfied by either: The singular value decmpsitin (SVD) f X: X = UΣΦ T where the ith clumn f Φ is the ith principal cmpnent φ i, and the ith clumn f UΣ is the ith vectr f scres (z 1i,..., z ni ). The eigendecmpsitin f X T X: X T X = ΦΣ 2 Φ T 15 / 24
PCA in practice: The biplt 0.5 0.0 0.5 Secnd Principal Cmpnent 3 2 1 0 1 2 3 UrbanPp Hawaii Rhde Massachusetts Island Utah Califrnia New Jersey Cnnecticut Washingtn Clrad New Yrk Ohi Illinis Arizna Nevada Wiscnsin Minnesta Pennsylvania Oregn Rape Texas Kansas Oklahma Delaware Nebraska Missuri Iwa Indiana Michigan New Hampshire Flrida Idah Virginia New Mexic Maine Wyming Maryland rth Dakta Mntana Assault Suth Dakta Tennessee Kentucky Luisiana Arkansas Alabama Alaska Gergia VermntWest Virginia Murder Suth Carlina Nrth Carlina Mississippi 0.5 0.0 0.5 3 2 1 0 1 2 3 First Principal Cmpnent Figure 10.1 16 / 24
Scaling the variables Mst f the time, we dn t care abut the abslute numerical value f a variable. 17 / 24
Scaling the variables Mst f the time, we dn t care abut the abslute numerical value f a variable. We care abut the value relative t the spread bserved in the sample. Befre PCA, in additin t centering each variable, we als multiply it times a cnstant t make its variance equal t 1. 17 / 24
Example: scaled vs. unscaled PCA Scaled 0.5 0.0 0.5 Unscaled 0.5 0.0 0.5 1.0 Secnd Principal Cmpnent 3 2 1 0 1 2 3 UrbanPp Rape Assault Murder 0.5 0.0 0.5 Secnd Principal Cmpnent 100 50 0 50 100 150 UrbanPp Rape Murder Assau 0.5 0.0 0.5 1.0 3 2 1 0 1 2 3 First Principal Cmpnent 100 50 0 50 100 150 First Principal Cmpnent Figure 10.3 18 / 24
Scaling the variables In special cases, we have variables measured in the same unit; e.g. gene expressin levels fr different genes. 19 / 24
Scaling the variables In special cases, we have variables measured in the same unit; e.g. gene expressin levels fr different genes. Therefre, we care abut the abslute value f the variables and we can perfrm PCA withut scaling. 19 / 24
Hw many principal cmpnents are enugh? Murder 50 150 250 10 20 30 40 5 10 15 50 100 200 300 Assault UrbanPp 30 40 50 60 70 80 90 5 10 15 10 20 30 40 30 50 70 90 Rape 20 / 24
Hw many principal cmpnents are enugh? 0.5 0.0 0.5 Secnd Principal Cmpnent 3 2 1 0 1 2 3 UrbanPp Hawaii Rhde Massachusetts Island Utah Califrnia New Jersey Cnnecticut Washingtn Clrad New Yrk Ohi Illinis Arizna Nevada Wiscnsin Minnesta Pennsylvania Oregn Rape Texas Kansas Oklahma Delaware Nebraska Missuri Iwa Indiana Michigan New Hampshire Flrida Idah Virginia New Mexic Maine Wyming Maryland rth Dakta Mntana Assault Suth Dakta Tennessee Kentucky Luisiana Arkansas Alabama Alaska Gergia VermntWest Virginia Murder Suth Carlina Nrth Carlina Mississippi 0.5 0.0 0.5 3 2 1 0 1 2 3 First Principal Cmpnent We said 2 principal cmpnents capture mst f the relevant infrmatin. But hw can we tell? 20 / 24
The prprtin f variance explained We can think f the tp principal cmpnents as directins in space in which the data vary the mst. 21 / 24
The prprtin f variance explained We can think f the tp principal cmpnents as directins in space in which the data vary the mst. The ith scre vectr (z 1i,..., z ni ) can be interpreted as a new variable. The variance f this variable decreases as we take i frm 1 t p. 21 / 24
The prprtin f variance explained We can think f the tp principal cmpnents as directins in space in which the data vary the mst. The ith scre vectr (z 1i,..., z ni ) can be interpreted as a new variable. The variance f this variable decreases as we take i frm 1 t p. Hwever, the ttal variance f the scre vectrs is the same as the ttal variance f the riginal variables: p i=1 1 n n zji 2 = j=1 p Var(x k ). k=1 21 / 24
The prprtin f variance explained We can think f the tp principal cmpnents as directins in space in which the data vary the mst. The ith scre vectr (z 1i,..., z ni ) can be interpreted as a new variable. The variance f this variable decreases as we take i frm 1 t p. Hwever, the ttal variance f the scre vectrs is the same as the ttal variance f the riginal variables: p i=1 1 n n zji 2 = j=1 p Var(x k ). k=1 We can quantify hw much f the variance is captured by the first m principal cmpnents/scre variables. 21 / 24
The prprtin f variance explained The variance f the mth scre variable is: 1 n n i=1 z 2 im 22 / 24
The prprtin f variance explained The variance f the mth scre variable is: 1 n zim 2 = 1 n p φ jm x ij n n i=1 i=1 j=1 2 22 / 24
The prprtin f variance explained The variance f the mth scre variable is: 1 n zim 2 = 1 n p φ jm x ij n n i=1 i=1 j=1 2 = 1 n Σ2 mm. 22 / 24
The prprtin f variance explained The variance f the mth scre variable is: 1 n zim 2 = 1 n p φ jm x ij n n i=1 i=1 j=1 2 = 1 n Σ2 mm. Prp. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Prp. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Cmpnent 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Cmpnent 22 / 24
The prprtin f variance explained The variance f the mth scre variable is: 1 n zim 2 = 1 n p φ jm x ij n n i=1 i=1 j=1 2 = 1 n Σ2 mm. Prp. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 Cumulative Prp. Variance Explained 0.0 0.2 0.4 0.6 0.8 1.0 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Cmpnent Scree plt 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Principal Cmpnent 22 / 24
Generalizatins f PCA PCA wrks under a Euclidean gemetry in the space f variables. Often, the natural gemetry is different: 23 / 24
Generalizatins f PCA PCA wrks under a Euclidean gemetry in the space f variables. Often, the natural gemetry is different: We expect sme variables t be clser t each ther that t ther variables. 23 / 24
Generalizatins f PCA PCA wrks under a Euclidean gemetry in the space f variables. Often, the natural gemetry is different: We expect sme variables t be clser t each ther that t ther variables. Sme crrelatins between variables wuld be mre surprising than thers. 23 / 24
Generalizatins f PCA PCA wrks under a Euclidean gemetry in the space f variables. Often, the natural gemetry is different: We expect sme variables t be clser t each ther that t ther variables. Sme crrelatins between variables wuld be mre surprising than thers. Examples: Variables are pixel values, samples are different images f the brain. We expect neighbring pixels t have strnger crrelatins. 23 / 24
Generalizatins f PCA PCA wrks under a Euclidean gemetry in the space f variables. Often, the natural gemetry is different: We expect sme variables t be clser t each ther that t ther variables. Sme crrelatins between variables wuld be mre surprising than thers. Examples: Variables are pixel values, samples are different images f the brain. We expect neighbring pixels t have strnger crrelatins. Variables are rainfall measurements at different regins. We expect neighbring regins t have higher crrelatins. 23 / 24
Generalizatins f PCA There are ways t include this knwledge in a PCA. See: 1. Susan Hlmes. Multivariate Analysis, the French way. (2006). 2. Omar de la Cruz and Susan Hlmes. An intrductin t the duality diagram. (2011). 3. Stéphane Dray and Thibaut Jmbart. Revisiting Guerry s data: Intrducing spatial cnstraints in multivariate analysis. (2011). 4. Genevera Allen, Lgan Grsenick, and Jnathan Taylr. A Generalized Least Squares Matrix Decmpsitin. (2011). 24 / 24