PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani

Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature set Feature extraction A linear or non-linear transform on the original feature space x 1 x d x i1 x id x 1 x d y 1 y d = f x 1 x d Feature Selection (d < d) Feature Extraction 2

Feature Extraction Mapping of the original data to another space Criterion for feature extraction can be different based on problem settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space Feature extraction algorithms Linear Methods Unsupervised: e.g., Principal Component Analysis (PCA) Supervised: e.g., Linear Discriminant Analysis (LDA) Also known as Fisher s Discriminant Analysis (FDA) 3

Feature Extraction Unsupervised feature extraction: X = x 1 (1) (1) x d (N) x d x 1 (N) Feature Extraction Supervised feature extraction: A mapping f: R d R d Or only the transformed data X = x 1 (1) (1) x d x d x 1 (N) (N) X = x 1 (1) (1) x d (N) x d x 1 (N) Y = y (1) y (N) Feature Extraction A mapping f: R d R d Or only the transformed data X = x 1 (1) (1) x d x d x 1 (N) (N) 4

Unsupervised Feature Reduction Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage, communication, or and retrieval. Pre-process: to improve accuracy by reducing features As a preprocessing step to reduce dimensions for supervised learning tasks Helps avoiding overfitting Noise removal E.g, noise in the images introduced by minor lighting variations, slightly different imaging conditions, 5

Linear Transformation For linear transformation, we find an explicit mapping f x = A T x that can transform also new data vectors. Original data Type equation here. = reduced data A T R d d x R d x R x = A T x d < d 6

Linear Transformation Linear transformation are simple mappings T xa x ( x a x) j j T j = 1,, d A = a 11 a 1d a d1 a dd a1 ad 7

Linear Dimensionality Reduction Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) [we will discuss] SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA) 8

Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions Find the directions at which data approximately lie When the data is projected onto first PC, the variance of the projected data is maximized 9

Principal Component Analysis (PCA) The best linear subspace (i.e. providing least reconstruction error of data): 10 Find mean reduced data The axes have been rotated to new (principal) axes such that: Principal axis 1 has the highest variance... Principal axis i has the i-th highest variance. The principal axes are uncorrelated Covariance among each pair of the principal axes is zero. Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. PCs can be found as the best eigenvectors of the covariance matrix of the data points.

Principal components If data has a Gaussian distribution N(μ, Σ), the direction of the largest variance can be found by the eigenvector of Σ that corresponds to the largest eigenvalue of Σ 5 4 v 2 v 1 3 2 4.0 4.5 5.0 5.5 6.0 11

Covariance Matrix μ x = μ 1 μ d = E(x 1 ) E(x d ) Σ = E x μ x x μ x T ML estimate of covariance matrix from data points x (i) N i=1 : N Σ = 1 N i=1 x (i) μ x (i) μ T = 1 N XT X X = x (1) x (N) = x (1) μ x (N) μ N μ = 1 N i=1 x (i) 12 Mean-centered data

PCA: Steps Input: N d data matrix X (each row contain a d dimensional data point) μ = 1 N i=1 N x (i) X Mean value of data points is subtracted from rows of X Σ = 1 N XT X (Covariance matrix) Calculate eigenvalue and eigenvectors of Σ Pick d eigenvectors corresponding to the largest eigenvalues and put them in the columns of A = [v 1,, v d ] X = XA First PC d -th PC 13

Another Interpretation: Least Squares Error PCs are linear least squares fits to samples, each orthogonal to the previous PCs: First PC is a minimum distance fit to a vector in the original feature space Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC 14

Another Interpretation: example 15

Another Interpretation: example 16

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin 17

PCA: Uncorrelated Features x = A T x R x = E x x T = E A T xx T A = A T E xx T A = A T R x A If A = [a 1,, a d ] where a 1,, a d are orthonormal eighenvectors of R x : R x = A T R x A = A T AΛA T A = Λ i j i, j = 1,, d E x i x j = 0 then mutually uncorrelated features are obtained Completely uncorrelated features avoid information redundancies 18

PCA Derivation (Correlation Version): Mean Square Error Approximation Incorporating all eigenvectors in A = [a 1,, a d ]: x = A T x Ax = AA T x = x x = Ax If d = d then x can be reconstructed exactly from x 19

PCA Derivation (Correlation Version): Relation between Eigenvalues and Variances The j-th largest eigenvalue of R x is the variance on the j-th PC: var x j = a j T R x a j = λ j 20

PCA Derivation (Correlation Version): Mean Square Error Approximation Incorporating only d eigenvectors corresponding to the largest eigenvalues A = [a 1,, a d ] (d < d) It minimizes MSE between x and x = Ax : 21 J A = E x x 2 = E x Ax 2 = E = E = d j=d +1 k=d +1 d j=d +1 d a j T E xx T a j = x j a j T a k x k = E d j=d +1 d j=d +1 a j T R x a j = d j=d +1 x j 2 = d j=d +1 x j a j d j=d +1 λ j 2 E x j 2 Sum of the d d smallest eigenvalues

PCA Derivation (Correlation Version): Mean Square Error Approximation In general, it can also be shown MSE is minimized compared to any other approximation of x by any d -dimensional orthonormal basis without first assuming that the axes are eigenvectors of the correlation matrix, this result can also be obtained. If the data is mean-centered in advance, R x and C x (covariance matrix) will be the same. However, in the correlation version when C x R x the approximation is not, in general, a good one (although it is a minimum MSE solution) 22

PCA on Faces: Eigenfaces ORL Database 23 Some Images

PCA on Faces: Eigenfaces Average face 1 st PC 6 th PC For eigen faces gray = 0, white > 0, black < 0 24

PCA on Faces: x is a 112 92 = 10304 dimensional vector containing intensity of the pixels of this image Feature vector=[x 1,x 2,,x d ] x i = PC i T x The projection of x on the i-th PC = +x 1 +x 2 + +x 256 Average Face 25

PCA on Faces: Reconstructed Face d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 d'=256 Original Image 26

Kernel PCA Kernel extension of PCA data (approximately) lies on a lower dimensional non-linear space 27

PCA and LDA: Drawbacks PCA drawback: An excellent information packing transform does not necessarily lead to a good class separability. The directions of the maximum variance may be useless for classification purpose LDA PCA LDA drawback 28 Singularity or under-sampled problem (when N < d) Example: gene expression data, images, text documents Can reduces dimension only to d C 1 (unlike PCA)

PCA vs. LDA Although LDA often provide more suitable features for classification tasks, PCA might outperform LDA in some situations: when the number of samples per class is small (overfitting problem of LDA) when the training data non-uniformly sample the underlying distribution when the number of the desired features is more than C 1 Advances in the last decade: Semi-supervised feature extraction E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA) 29

Independent Component Analysis (ICA) PCA: The transformed dimensions will be uncorrelated from each other Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix) ICA: The transformed dimensions will be as independent as possible. Non-orthogonal linear transform High-order statistics can also used 30

Uncorrelated and Independent Gaussian Independent Uncorrelated Uncorrelated: cov X 1,X 2 = 0 Independent: P X 1,X 2 = P(X 1 )P(X 2 ) Non-Gaussian Independent Uncorrelated Uncorrelated Independent 31

ICA: Cocktail party problem Cocktail party problem d speakers are speaking simultaneously and any microphone records only an overlapping combination of these voices. Each microphone records a different combination of the speakers voices. Using these d microphone recordings, can we separate out the original d speakers speech signals? Mixing matrix A: Unmixing matrix A 1 : x = As s = A 1 x s j (i) : sound that speaker j was uttering at time i. 32 x j (i) : acoustic reading recorded by microphone j at time i.

ICA: Ambiguities We cannot determine the order of the independent components. There will be no way to distinguish between As and AP 1 Ps (P is a permutation matrix) There is no way to recover the correct scaling of sources There will be no way to distinguish between As and A/α αs Gaussian distribution N(0, I) of sources The distribution of any orthonormal transformation of the Gaussian has exactly the same distribution as that Gaussian 33

Gaussian distribution N(0, I) of sources x = As x = A s A = AR RR T = R T R = I E x x T = E ARR T A T = AA T = E xx T There is no way to tell if sources were mixed using A or A an arbitrary rotational component that cannot be determined from the data, and thus we cannot recover the original sources. So long as the data is not Gaussian, it is possible, given enough data, to recover the d independent sources. 34

ICA: Assumptions p x x = p s (A 1 x) A 1 Consider the assumption that the sources are independent: p s = d i=1 p(s i ) W = A 1 s i = w i T x What we ll choose for the cdf, sigmoid function σ(s) = 1/(1 + e s ) as a reasonable default function Hence, p(s) = σ (s). 35

ICA: Log likelihood l W = n i=1 d j=1 log σ w j T x (i) + log W Stochastic gradient descent: 1 2σ w T 1 x (i) W = W + α x i 1 2σ w T d x (i) T + W 1 T W W = W W 1 T 36

Summary PCA is a linear dimensionality reduction method that finds an orthonormal basis (minimizing MSE) ICA finds a linear transformation to make components as independent as possible 37