Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Similar documents
Machine Learning. A. Supervised Learning A.1. Linear Regression. Lars Schmidt-Thieme

Probabilistic Latent Semantic Analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lars Schmidt-Thieme, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany

Machine Learning - MT & 14. PCA and MDS

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Exercise Sheet 1. 1 Probability revision 1: Student-t as an infinite mixture of Gaussians

14 Singular Value Decomposition

Unsupervised dimensionality reduction

Statistical Pattern Recognition

Principal Component Analysis

PCA and admixture models

An Introduction to Statistical and Probabilistic Linear Models

Statistical Machine Learning

Motivating the Covariance Matrix

Clustering VS Classification

IV. Matrix Approximation using Least-Squares

STA141C: Big Data & High Performance Statistical Computing

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

CS 340 Lec. 6: Linear Dimensionality Reduction

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

STA141C: Big Data & High Performance Statistical Computing

Data Mining Techniques

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Principal components analysis COMS 4771

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Principal Component Analysis

Principal Component Analysis (PCA)

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine learning for pervasive systems Classification in high-dimensional spaces

7 Principal Component Analysis

15 Singular Value Decomposition

Singular Value Decomposition

Numerical Methods I Singular Value Decomposition

Linear Dimensionality Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

EECS 275 Matrix Computation

Dimensionality Reduction with Principal Component Analysis

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

MLCC 2015 Dimensionality Reduction and PCA

Data Mining Techniques

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Data Mining II. Prof. Dr. Karsten Borgwardt, Department Biosystems, ETH Zürich. Basel, Spring Semester 2016 D-BSSE

4 Bias-Variance for Ridge Regression (24 points)

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Lecture Notes 2: Matrices

Example: Face Detection

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

GWAS V: Gaussian processes

Introduction to Machine Learning

CS281 Section 4: Factor Analysis and PCA

Lecture 6: Methods for high-dimensional problems

Principal Component Analysis

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

A Least Squares Formulation for Canonical Correlation Analysis

Computer Vision Group Prof. Daniel Cremers. 2. Regression (cont.)

Linear Algebra Methods for Data Mining

EE731 Lecture Notes: Matrix Computations for Signal Processing

Machine Learning (Spring 2012) Principal Component Analysis

Linear and Non-Linear Dimensionality Reduction

Collaborative Filtering: A Machine Learning Perspective

Nonparameteric Regression:

Unsupervised Learning Techniques Class 07, 1 March 2006 Andrea Caponnetto

STA 414/2104: Lecture 8

CSC411: Final Review. James Lucas & David Madras. December 3, 2018

Dimensionality Reduction

STA 414/2104: Machine Learning

Singular Value Decomposition

The Singular Value Decomposition

Linear Regression and Its Applications

Tutorial on Gaussian Processes and the Gaussian Process Latent Variable Model

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Linear Regression and Discrimination

L11: Pattern recognition principles

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

PCA and LDA. Man-Wai MAK

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis and Linear Discriminant Analysis

Data Mining and Analysis: Fundamental Concepts and Algorithms

1 Singular Value Decomposition and Principal Component

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Nonlinear Dimensionality Reduction

Principal Component Analysis

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Preprocessing & dimensionality reduction

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

The Mathematics of Facial Recognition

Machine Learning 2nd Edition

LECTURE NOTE #11 PROF. ALAN YUILLE

Chemometrics: Classification of spectra

Kernel methods, kernel SVM and ridge regression

Leverage Sparse Information in Predictive Modeling

Transcription:

Machine Learning B. Unsupervised Learning B.2 Dimensionality Reduction Lars Schmidt-Thieme, Nicolas Schilling Information Systems and Machine Learning Lab (ISMLL) Institute for Computer Science University of Hildesheim, Germany 1 / 31

Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 2 / 31

Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 1 / 31

The Dimensionality Reduction Problem Given a set X called data space, e.g., X := R m, a set X X called data, a function D : X X,K N (R K ) X R + 0 called distortion where D(P) measures how bad a low dimensional representation P : X R K for a data set X X is, and a number K N of latent dimensions, find a low dimensional representation P : X R K with K dimensions with minimal distortion D(P). 1 / 31

Distortions for Dimensionality Reduction (1/2) Let d X be a distance on X and d Z be a distance on the latent space R K, usually just the Euclidean distance K d Z (v, w) := v w 2 = ( (v i w i ) 2 ) 1 2 i=1 Multidimensional scaling aims to find latent representations P that reproduce the distance measure d X as good as possible: D(P) := = 2 X ( X 1) 2 n(n 1) x,x X x x (d X (x, x ) d Z (P(x), P(x ))) 2 n i 1 (d X (x i, x j ) z i z j ) 2, z i := P(x i ) i=1 j=1 2 / 31

Distortions for Dimensionality Reduction (2/2) Feature reconstruction methods aim to find latent representations P and reconstruction maps r : R K X from a given class of maps that reconstruct features as good as possible: D(P, r) := 1 X = 1 n d X (x, r(p(x))) x X n d X (x i, r(z i )), z i := P(x i ) i=1 3 / 31

PCA: Intuition We want to find an orthogonal basis which represent directions of maximum variance 4 / 31

PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data 5 / 31

PCA: Intuition we want to find an orthogonal basis which represent directions of maximum variance red line indicates direction of maximal variance within the data blue line represents direction of maximal variance given that it has to be orthogonal to the red line 6 / 31

PCA: Task We assume that the data has mean Zero: n x i = 0 i=1 Then PCA can be accomplished by finding the top K eigenvalues of the covariance matrix X X of the data! Mean has to be zero to account for different units in the data (Degrees C vs Degrees F) Can also be accomplished through a singular value decomposition of the data matrix X as we will see 7 / 31

Singular Value Decomposition (SVD) Theorem (Existence of SVD) For every A R n m there exist matrices U R n k,v R m k, Σ := diag(σ 1,..., σ k ) R k k, k := min{n, m} σ 1 σ 2 σ r > σ r+1 = = σ k = 0, r := rank(a) U, V orthonormal, i.e., U T U = I, V T V = I with A = UΣV T σ i are called singular values of A. Note: I := diag(1,..., 1) R k k denotes the unit matrix. 8 / 31

Singular Value Decomposition (SVD; 2/2) It holds: a) σi 2 are eigenvalues and V i eigenvectors of A T A: (A T A)V i = σi 2 V i, i = 1,..., k, V = (V 1,..., V k ) b) σi 2 are eigenvalues and U i eigenvectors of AA T : (AA T )U i = σi 2 U i, i = 1,..., k, U = (U 1,..., U k ) 9 / 31

Singular Value Decomposition (SVD; 2/2) It holds: a) σi 2 are eigenvalues and V i eigenvectors of A T A: (A T A)V i = σi 2 V i, i = 1,..., k, V = (V 1,..., V k ) b) σi 2 are eigenvalues and U i eigenvectors of AA T : (AA T )U i = σi 2 U i, i = 1,..., k, U = (U 1,..., U k ) proof: a) (A T A)V i = V Σ T U T UΣV T V i = V Σ 2 e i = σi 2 V i b) (AA T )U i = UΣ T V T V Σ T U T U i = UΣ 2 e i = σi 2 U i 9 / 31

Truncated SVD Let A R n m and UΣV T = A its SVD. Then for k min{n, m} the decomposition with A = U Σ V T U := (U,1,..., U,k ), V := (V,1,..., V,k ), Σ := diag(σ 1,..., σ k ) is called truncated SVD with rank k. 10 / 31

Low Rank Approximation Let A R n m. For k min{n, m}, any pair of matrices U R n k, V R m k is called a low rank approximation of A with rank k. The matrix UV T is called the reconstruction of A by U, V and the quantity A UV T F the L2 reconstruction error. 11 / 31

Low Rank Approximation Let A R n m. For k min{n, m}, any pair of matrices U R n k, V R m k is called a low rank approximation of A with rank k. The matrix UV T is called the reconstruction of A by U, V and the quantity A UV T F = n m (A i,j Ui T V j ) 2 i=1 j=1 the L2 reconstruction error. Note: A F is called Frobenius norm. 11 / 31

Optimal Low Rank Approximation is Truncated SVD Theorem (Low Rank Approximation; Eckart-Young theorem) Let A R n m. For k min{n, m}, the optimal low rank approximation of rank k (i.e., with smallest reconstruction error) is the truncated SVD. (U, V ) := arg min U R n k,v R m k A UV T 2 Note: As U, V do not have to be orthonormal, one can take U := U Σ, V := V for the SVD A = U Σ V T. 12 / 31

Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components K x i z i,k v k k=1 reconstructs the original features x i as good as possible: 13 / 31

Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components reconstructs the original features x i as good as possible: arg min v 1,...,v K z 1,...,zn = n x i i=1 K z i,k v k 2 k=1 n x i Vz i 2, i=1 V := (v 1,..., v K ) T 13 / 31

Principal Components Analysis (PCA) Let X := {x 1,..., x n } R m be a data set and K N the number of latent dimensions (K m). PCA finds K principal components v 1,..., v K R m and latent weights z i R K for each data point i {1,..., n}, such that the linear combination of the principal components reconstructs the original features x i as good as possible: arg min v 1,...,v K z 1,...,zn = n x i i=1 K z i,k v k 2 k=1 n x i Vz i 2, i=1 = X ZV T 2 F, V := (v 1,..., v K ) T thus PCA is just the SVD of the data matrix X. X := (x 1,..., x n ) T, Z := (z 1,..., z n ) T 13 / 31

PCA Algorithm 1: procedure dimred-pca(d := {x 1,..., x N } R M, K N) 2: X := (x 1, x 2,..., x N ) T 3: (U, Σ, V ) := svd(x ) 4: Z := U.,1:K Σ 1:K,1:K 5: return D dimred := {Z 1,.,..., Z N,. } 14 / 31

FIGURE 14.15. Simulated data in three classes, near 15 / 31 Principal Components Analysis (Example 1) 1.5 1 0.5 0 0.5 1 1.5 1 0.5 0 0.5 1 1 0.5 0 1.5 1 0.5 [HTFF05, p. 530]

Principal Components Analysis (Example 1) Second principal component 1.0 0.5 0.0 0.5 1.0 1.0 0.5 0.0 0.5 1.0 First principal component FIGURE 14.21. The best rank-two linear approximation to the half-sphere data. The right panel shows the [HTFF05, p. 536] projected points with coordinates given by U 2 D 2,the first two principal components of the data. 15 / 31

Principal Components Analysis (Example 2) FIGURE 14.22. A sample of 130 handwritten [HTFF05, 3 s p. 537] shows a variety of writing styles. 16 / 31

Principal Components Analysis (Example 2) First Principal Component Second Principal Component -6-4 -2 0 2 4 6 8-5 0 5 O O O O O O O O O O O O O O O O O O O O O O O O O FIGURE 14.23. (Left panel:) the first two principal components of the handwritten threes. The circled points are the closest projected images to the vertices of 16 / 31 [HTFF05, p. 538]

2. Non-linear Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 17 / 31

2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 17 / 31

2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 2. compute lower dimensional representations for new data points x (often called fold in ) for PCA: u := arg min x V Σu 2 = Σ 1 V T x u 17 / 31

2. Non-linear Dimensionality Reduction Linear Dimensionality Reduction Dimensionality reduction accomplishes two tasks: 1. compute lower dimensional representations for given data points x i for PCA: u i = Σ 1 V T x i, U := (u 1,..., u n ) T 2. compute lower dimensional representations for new data points x (often called fold in ) for PCA: u := arg min x V Σu 2 = Σ 1 V T x u PCA is called a linear dimensionality reduction technique because the latent representations u depend linearly on the observed representations x. 17 / 31

2. Non-linear Dimensionality Reduction Kernel Trick Represent (conceptionally) non-linearity by linearity in a higher dimensional embedding φ : R m R m but compute in lower dimensionality for methods that depend on x only through a scalar product x T θ = φ(x) T φ(θ) = k(x, θ), x, θ R m if k can be computed without explicitly computing φ. 18 / 31

2. Non-linear Dimensionality Reduction Kernel Trick / Example Example: φ : R R 1001, ( ( 1000 x i ) 1 ) 2 x i i=0,...,1000 1000 ( x T θ = φ(x) T 1000 φ(θ) = i Naive computation: i=0 = 1 31.62 x 706.75 x 2. 31.62 x 999 x 1000 ) x i θ i = (1 + xθ) 1000 =: k(x, θ) 2002 binomial coefficients, 3003 multiplications, 1000 additions. Kernel computation: 1 multiplication, 1 addition, 1 exponentiation. 19 / 31

2. Non-linear Dimensionality Reduction Kernel PCA φ :R m R m, φ(x 1 ) φ(x 2 ) X :=. φ(x n ) X UΣṼ T m m We can compute the columns of U as eigenvectors of X X T R n n without having to compute Ṽ R m k (which is large!): X X T U i = σ 2 i U i 20 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T Note: 1I := (1) i=1,...,n,j=1,...,n vector of ones, I := (δ(i = j)) Lars Schmidt-Thieme, i=1,...,n,j=1,...,n unity matrix. Nicolas Schilling, Information Systems and Machine Learning Lab (ISMLL), University of Hildesheim, Germany 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Removing the Mean Issue 1: The x i := φ(x i ) may not have zero mean and thus distort PCA. x i := x i 1 n n x i i=1 = X T (I 1 n 1I) X :=( x 1,..., x n) T = (I 1 n 1I) X T K := X X T = (I 1 n 1I) X T X (I 1 n 1I) =HKH, H := (I 1 1I) centering matrix n Thus, the kernel matrix K with means removed can be computed from the kernel matrix K without having to access coordinates. 21 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Fold In Issue 2: How to compute projections u of new points x (as Ṽ is not computed)? With Ṽ = X T UΣ 1 u := arg min x Ṽ Σu 2 = Σ 1 Ṽ T x u u = Σ 1 Ṽ T x = Σ 1 Σ 1 U T X x = Σ 2 U T (k(x i, x)) i=1,...,n u can be computed with access to kernel values only (and to U, Σ). 22 / 31

2. Non-linear Dimensionality Reduction Kernel PCA / Summary Given: data set X := {x 1,..., x n } R m, kernel function k : R m R m R. task 1: Learn latent representations U of data set X : K :=(k(x i, x j )) i=1,...,n,j=1,...,n (0) K :=HKH, H := (I 1 1I) (1) n (U, Σ) :=eigen decomposition (K ) (2) task 2: Learn latent representation u of new point x: u := Σ 2 U T (k(x i, x)) i=1,...,n (3) 23 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 1 Eigenvalue=22.558 1.5 Eigenvalue=20.936 1.5 Eigenvalue=4.648 1.5 Eigenvalue=3.988 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 Eigenvalue=3.372 1.5 Eigenvalue=2.956 1.5 Eigenvalue=2.760 1.5 Eigenvalue=2.211 1.5 1 1 1 1 0.5 0.5 0.5 0.5 0 0 0 0 0.5 1 0 1 0.5 0.5 0.5 1 0 1 1 0 1 1 0 1 [Mur12, p. 493] 24 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 2 0.6 pca 0.4 0.2 0 0.2 0.4 0.6 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 [Mur12, p. 495] 25 / 31

2. Non-linear Dimensionality Reduction Kernel PCA: Example 2 0.6 kpca 0.4 0.2 0 0.2 0.4 0.6 0.8 0.8 0.6 0.4 0.2 0 0.2 0.4 0.6 0.8 [Mur12, p. 495] 25 / 31

3. Supervised Dimensionality Reduction Outline 2. Non-linear Dimensionality Reduction 3. Supervised Dimensionality Reduction 26 / 31

3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Given a prediction task and a data set D train := {(x 1, y 1 ),..., (x n, y n )} R m Y. 1. compute latent features z i R K for the objects of a data set by means of dimensionality reduction of the predictors x i. e.g., using PCA on {x1,..., x n } R m 2. learn a prediction model on the latent features based on ŷ : R K Y D train := {(z 1, y 1 ),..., (z n, y n )} 3. treat the number K of latent dimensions as hyperparameter. e.g., find using grid search. 26 / 31

3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Advantages: simple procedure generic procedure works with any dimensionality reduction method and prediction method as component methods. usually fast 27 / 31

3. Supervised Dimensionality Reduction Dimensionality Reduction as Pre-Processing Advantages: simple procedure generic procedure works with any dimensionality reduction method and prediction method as component methods. usually fast Disadvantages: dimensionality reduction is unsupervised, i.e., not informed about the target that should be predicted later on. leads to the very same latent features regardless of the prediction task. likely not the best task-specific features are extracted. 27 / 31

3. Supervised Dimensionality Reduction Supervised PCA p(z) := N (z; 0, 1) p(x z; µ x, σ 2 x, W x ) := N (x; µ x + W x z, σ 2 xi ) p(y z; µ y, σ 2 y, W y ) := N (y; µ y + W y z, σ 2 y I ) like two PCAs, coupled by shared latent features z: one for the predictors x. one for the targets y. latent features act as information bottleneck. also known as Latent Factor Regression or Bayesian Factor Regression. 28 / 31

3. Supervised Dimensionality Reduction Supervised PCA: Discriminative Likelihood A simple likelihood would put the same weight on reconstructing the predictors and reconstructing the targets. A weight α R + 0 for the reconstruction error of the predictors should be introduced (discriminative likelihood): L α (Θ; x, y, z) := n p(y i z i ; Θ)p(x i z i ; Θ) α p(z i ; Θ) i=1 α can be treated as hyperparameter and found by grid search. 29 / 31

3. Supervised Dimensionality Reduction Supervised PCA: EM The M-steps for µ x, σ 2 x, W x and µ y, σ 2 y, W y are exactly as before. the coupled E-step is: ( 1 z i = σy 2 Wy T W y + α 1 ) 1 σx 2 Wx T W x ( 1 σy 2 Wy T (y i µ y ) + α 1 ) σx 2 Wx T (x i µ x ) 30 / 31

3. Supervised Dimensionality Reduction Conclusion (1/3) Dimensionality reduction aims to find a lower dimensional representation of data that preserves the information as much as possible. Preserving information means to preserve pairwise distances between objects (multidimensional scaling). to be able to reconstruct the original object features (feature reconstruction). The truncated Singular Value Decomposition (SVD) provides the best low rank factorization of a matrix in two factor matrices. SVD is usually computed by an algebraic factorization method (such as QR decomposition). 31 / 31

3. Supervised Dimensionality Reduction Conclusion (2/3) Principal components analysis (PCA) finds latent object and variable features that provide the best linear reconstruction (in L2 error). PCA is a truncated SVD of the data matrix. Probabilistic PCA (PPCA) provides a probabilistic interpretation of PCA. PPCA adds a L2 regularization of the object features. PPCA is learned by the EM algorithm. Adding L2 regularization for the linear reconstruction/variable features on top leads to Bayesian PCA. Generalizing to variable-specific variances leads to Factor Analysis. For both, Bayesian PCA and Factor Analysis, EM can be adapted easily. 31 / 31

3. Supervised Dimensionality Reduction Conclusion (3/3) To capture a nonlinear relationship between latent features and observed features, PCA can be kernelized (Kernel PCA). Learning a Kernel PCA is done by an eigen decomposition of the kernel matrix. Kernel PCA often is found to lead to unnatural visualizations. But Kernel PCA sometimes provides better classification performance for simple classifiers on latent features (such as 1-Nearest Neighbor). 31 / 31

Readings Principal Components Analysis (PCA) [HTFF05], ch. 14.5.1, [Bis06], ch. 12.1, [Mur12], ch. 12.2. Probabilistic PCA [Bis06], ch. 12.2, [Mur12], ch. 12.2.4. Factor Analysis [HTFF05], ch. 14.7.1, [Bis06], ch. 12.2.4. Kernel PCA [HTFF05], ch. 14.5.4, [Bis06], ch. 12.3, [Mur12], ch. 14.4.4. 32 / 31

Further Readings (Non-negative) Matrix Factorization [HTFF05], ch. 14.6 Independent Component Analysis, Exploratory Projection Pursuit [HTFF05], ch. 14.7 [Bis06], ch. 12.4 [Mur12], ch. 12.6. Nonlinear Dimensionality Reduction [HTFF05], ch. 14.9, [Bis06], ch. 12.4 33 / 31

References Christopher M. Bishop. Pattern recognition and machine learning, volume 1. springer New York, 2006. Trevor Hastie, Robert Tibshirani, Jerome Friedman, and James Franklin. The elements of statistical learning: data mining, inference and prediction, volume 27. 2005. Kevin P. Murphy. Machine learning: a probabilistic perspective. The MIT Press, 2012. 34 / 31