Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 31

Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 2 / 31

Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 1 / 31

The Dimensionality Reduction Problem Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X,K N (R K ) X R + 0 called distortion where D(P) measures how bad a low dimensional representation P : X R K for a data set X X is, and a number K N of latent dimensions, find a low dimensional representation P : X R K with K dimensions with minimal distortion D(P). 1 / 31

Distortions for Dimensionality Reduction (1/2) Let d X be a distance on X and d Z be a distance on the latent space R K, usually just the Euclidean distance K d Z (v, w) := v w 2 = ( (v i w i ) 2 ) 1 2 i=1 Multidimensional scaling aims to find latent representations P that reproduce the distance measure d X as good as possible: D(P) := = 2 X ( X 1) 2 n(n 1) x,x X x x (d X (x, x ) d Z (P(x), P(x ))) 2 n i 1 (d X (x i, x j ) z i z j ) 2, z i := P(x i ) i=1 j=1 2 / 31

Distortions for Dimensionality Reduction (2/2) Feature reconstruction methods aim to find latent representations P and reconstruction maps r : R K X from a given class of maps that reconstruct features as good as possible: D(P, r) := 1 X = 1 n d X (x, r(p(x))) x X n d X (x i, r(z i )), z i := P(x i ) i=1 3 / 31

PCA: Intuition We want to find an orthogonal basis which represent directions of maximum variance 4 / 31

PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data 5 / 31

PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data blue line represents direction of maximal variance given that it has to be orthogonal to the red line 6 / 31

PCA: Task We assume that the data has mean Zero: n x i = 0 i=1 Then PCA can be accomplished by finding the top K eigenvalues of the covariance matrix X X of the data! Mean has to be zero to account for different units in the data (Degrees C vs Degrees F) Can also be accomplished through a singular value decomposition of the data matrix X as we will see 7 / 31

Singular Value Decomposition (SVD) Theorem (Existence of SVD) For every A R n m there exist matrices U R n k,v R m k, Σ := diag(σ 1,..., σ k ) R k k, k := min{n, m} σ 1 σ 2 σ r > σ r+1 = = σ k = 0, r := rank(a) U, V orthonormal, i.e., U T U = I, V T V = I with A = UΣV T σ i are called singular values of A. Note: I := diag(1,..., 1) R k k denotes the unit matrix. 8 / 31

Singular Value Decomposition (SVD; 2/2) It holds: a) σi 2 are eigenvalues and V i eigenvectors of A T A: (A T A)V i = σi 2 V i, i = 1,..., k, V = (V 1,..., V k ) b) σi 2 are eigenvalues and U i eigenvectors of AA T : (AA T )U i = σi 2 U i, i = 1,..., k, U = (U 1,..., U k ) proof: a) (A T A)V i = V Σ T U T UΣV T V i = V Σ 2 e i = σi 2 V i b) (AA T )U i = UΣ T V T V Σ T U T U i = UΣ 2 e i = σi 2 U i 9 / 31

Truncated SVD Let A R n m and UΣV T = A its SVD. Then for k min{n, m} the decomposition with A = U Σ V T U := (U,1,..., U,k ), V := (V,1,..., V,k ), Σ := diag(σ 1,..., σ k ) is called truncated SVD with rank k. 10 / 31

Low Rank Approximation Let A R n m. For k min{n, m}, any pair of matrices U R n k, V R m k is called a low rank approximation of A with rank k. The matrix UV T is called the reconstruction of A by U, V and the quantity A UV T F the L2 reconstruction error. 11 / 31

Optimal Low Rank Approximation is Truncated SVD Theorem (Low Rank Approximation; Eckart-Young theorem) Let A R n m. For k min{n, m}, the optimal low rank approximation of rank k (i.e., with smallest reconstruction error) is the truncated SVD. (U, V ) := arg min U R n k,v R m k A UV T 2 Note: As U, V do not have to be orthonormal, one can take U := U Σ, V := V for the SVD A = U Σ V T. 12 / 31

Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components reconstructs the original features x i as good as possible: arg min v 1,...,v K z 1,...,zn = n x i i=1 K z i,k v k 2 k=1 n x i Vz i 2, i=1 V := (v 1,..., v K ) T 13 / 31

PCA Algorithm 1: procedure dimred-pca(d := {x 1,..., x N } R M, K N) 2: X := (x 1, x 2,..., x N ) T 3: (U, Σ, V ) := svd(x ) 4: Z := U.,1:K Σ 1:K,1:K 5: return D dimred := {Z 1,.,..., Z N,. } 14 / 31

FIGURE 14.15. Simulated data in three classes, near 15 / 31 Principal Components Analysis (Example 1) 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1 0.5 0 1.5 1 0.5 [HTFF05, p. 530]

Principal Components Analysis (Example 1) Second principal component 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 First principal component FIGURE 14.21. The best rank-two linear approximation to the half-sphere data. The right panel shows the [HTFF05, p. 536] projected points with coordinates given by U 2 D 2,the first two principal components of the data. 15 / 31

Principal Components Analysis (Example 2) FIGURE 14.22. A sample of 130 handwritten [HTFF05, 3 s p. 537] shows a variety of writing styles. 16 / 31

Principal Components Analysis (Example 2) First Principal Component Second Principal Component -6-4 -2 0 2 4 6 8-5 0 5 O O O O O O O O O O O O O O O O O O O O O O O O O FIGURE 14.23. (Left panel:) the first two principal components of the handwritten threes. The circled points are the closest projected images to the vertices of 16 / 31 [HTFF05, p. 538]

2. Non-linear Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 17 / 31

2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 2. compute lower dimensional representations for new data points x (often called fold in ) for PCA: u := arg min x V Σu 2 = Σ 1 V T x u 17 / 31

2. Non-linear Dimensionality Reduction Kernel Trick Represent (conceptionally) non-linearity by linearity in a higher dimensional embedding φ : R m R m but compute in lower dimensionality for methods that depend on x only through a scalar product x T θ = φ(x) T φ(θ) = k(x, θ), x, θ R m if k can be computed without explicitly computing φ. 18 / 31

2. Non-linear Dimensionality Reduction Kernel Trick / Example Example: φ : R R 1001, ( ( 1000 x i ) 1 ) 2 x i i=0,...,1000 1000 ( x T θ = φ(x) T 1000 φ(θ) = i Naive computation: i=0 = 1 31.62 x 706.75 x 2. 31.62 x 999 x 1000 ) x i θ i = (1 + xθ) 1000 =: k(x, θ) 2002 binomial coefficients, 3003 multiplications, 1000 additions. Kernel computation: 1 multiplication, 1 addition, 1 exponentiation. 19 / 31

2. Non-linear Dimensionality Reduction Kernel PCA φ :R m R m, φ(x 1 ) φ(x 2 ) X :=. φ(x n ) X UΣṼ T m m We can compute the columns of U as eigenvectors of X X T R n n without having to compute Ṽ R m k (which is large!): X X T U i = σ 2 i U i 20 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T Note: 1I := (1) i=1,...,n,j=1,...,n vector of ones, I := (δ(i = j)) Lars Schmidt-Thieme, i=1,...,n,j=1,...,n unity matrix. Nicolas Schilling, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T K := X X T = (I 1 n 1I) X T X (I 1 n 1I) =HKH, H := (I 1 1I) centering matrix n Thus, the kernel matrix K with means removed can be computed from the kernel matrix K without having to access coordinates. 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Fold In Issue 2: How to compute projections u of new points x (as Ṽ is not computed)? With Ṽ = X T UΣ 1 u := arg min x Ṽ Σu 2 = Σ 1 Ṽ T x u u = Σ 1 Ṽ T x = Σ 1 Σ 1 U T X x = Σ 2 U T (k(x i, x)) i=1,...,n u can be computed with access to kernel values only (and to U, Σ). 22 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Summary Given: data set X := {x 1,..., x n } R m, kernel function k : R m R m R. task 1: Learn latent representations U of data set X : K :=(k(x i, x j )) i=1,...,n,j=1,...,n (0) K :=HKH, H := (I 1 1I) (1) n (U, Σ) :=eigen decomposition (K ) (2) task 2: Learn latent representation u of new point x: u := Σ 2 U T (k(x i, x)) i=1,...,n (3) 23 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 1 Eigenvalue=22.558 1.5 Eigenvalue=20.936 1.5 Eigenvalue=4.648 1.5 Eigenvalue=3.988 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=3.372 1.5 Eigenvalue=2.956 1.5 Eigenvalue=2.760 1.5 Eigenvalue=2.211 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 [Mur12, p. 493] 24 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 2 0.6 pca 0.4 0.2 0 0.2 0.4 0.6 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 [Mur12, p. 495] 25 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 2 0.6 kpca 0.4 0.2 0 0.2 0.4 0.6 0.8 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 [Mur12, p. 495] 25 / 31

3. Supervised Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 26 / 31

3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Given a prediction task and a data set D train := {(x 1, y 1 ),..., (x n, y n )} R m Y. 1. compute latent features z i R K for the objects of a data set by means of dimensionality reduction of the predictors x i. e.g., using PCA on {x1,..., x n } R m 2. learn a prediction model on the latent features based on ŷ : R K Y D train := {(z 1, y 1 ),..., (z n, y n )} 3. treat the number K of latent dimensions as hyperparameter. e.g., find using grid search. 26 / 31

3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Advantages: simple procedure generic procedure works with any dimensionality reduction method and prediction method as component methods. usually fast Disadvantages: dimensionality reduction is unsupervised, i.e., not informed about the target that should be predicted later on. leads to the very same latent features regardless of the prediction task. likely not the best task-specific features are extracted. 27 / 31

3. Supervised Dimensionality Reduction Supervised PCA p(z) := N (z; 0, 1) p(x z; µ x, σ 2 x, W x ) := N (x; µ x + W x z, σ 2 xi ) p(y z; µ y, σ 2 y, W y ) := N (y; µ y + W y z, σ 2 y I ) like two PCAs, coupled by shared latent features z: one for the predictors x. one for the targets y. latent features act as information bottleneck. also known as Latent Factor Regression or Bayesian Factor Regression. 28 / 31

3. Supervised Dimensionality Reduction Supervised PCA: Discriminative Likelihood A simple likelihood would put the same weight on reconstructing the predictors and reconstructing the targets. A weight α R + 0 for the reconstruction error of the predictors should be introduced (discriminative likelihood): L α (Θ; x, y, z) := n p(y i z i ; Θ)p(x i z i ; Θ) α p(z i ; Θ) i=1 α can be treated as hyperparameter and found by grid search. 29 / 31

3. Supervised Dimensionality Reduction Supervised PCA: EM The M-steps for µ x, σ 2 x, W x and µ y, σ 2 y, W y are exactly as before. the coupled E-step is: ( 1 z i = σy 2 Wy T W y + α 1 ) 1 σx 2 Wx T W x ( 1 σy 2 Wy T (y i µ y ) + α 1 ) σx 2 Wx T (x i µ x ) 30 / 31

3. Supervised Dimensionality Reduction Conclusion (1/3) Dimensionality reduction aims to find a lower dimensional representation of data that preserves the information as much as possible. Preserving information means to preserve pairwise distances between objects (multidimensional scaling). to be able to reconstruct the original object features (feature reconstruction). The truncated Singular Value Decomposition (SVD) provides the best low rank factorization of a matrix in two factor matrices. SVD is usually computed by an algebraic factorization method (such as QR decomposition). 31 / 31

3. Supervised Dimensionality Reduction Conclusion (2/3) Principal components analysis (PCA) finds latent object and variable features that provide the best linear reconstruction (in L2 error). PCA is a truncated SVD of the data matrix. Probabilistic PCA (PPCA) provides a probabilistic interpretation of PCA. PPCA adds a L2 regularization of the object features. PPCA is learned by the EM algorithm. Adding L2 regularization for the linear reconstruction/variable features on top leads to Bayesian PCA. Generalizing to variable-specific variances leads to Factor Analysis. For both, Bayesian PCA and Factor Analysis, EM can be adapted easily. 31 / 31

3. Supervised Dimensionality Reduction Conclusion (3/3) To capture a nonlinear relationship between latent features and observed features, PCA can be kernelized (Kernel PCA). Learning a Kernel PCA is done by an eigen decomposition of the kernel matrix. Kernel PCA often is found to lead to unnatural visualizations. But Kernel PCA sometimes provides better classification performance for simple classifiers on latent features (such as 1-Nearest Neighbor). 31 / 31

Readings Principal Components Analysis (PCA) [HTFF05], ch. 14.5.1, [Bis06], ch. 12.1, [Mur12], ch. 12.2. Probabilistic PCA [Bis06], ch. 12.2, [Mur12], ch. 12.2.4. Factor Analysis [HTFF05], ch. 14.7.1, [Bis06], ch. 12.2.4. Kernel PCA [HTFF05], ch. 14.5.4, [Bis06], ch. 12.3, [Mur12], ch. 14.4.4. 32 / 31

Further Readings (Non-negative) Matrix Factorization [HTFF05], ch. 14.6 Independent Component Analysis, Exploratory Projection Pursuit [HTFF05], ch. 14.7 [Bis06], ch. 12.4 [Mur12], ch. 12.6. Nonlinear Dimensionality Reduction [HTFF05], ch. 14.9, [Bis06], ch. 12.4 33 / 31

References Christopher M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume 27. 2005. Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. 34 / 31