PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Similar documents
Independent Components Analysis

Principal Component Analysis

Statistical Pattern Recognition

Dimension Reduction (PCA, ICA, CCA, FLD,

PCA, Kernel PCA, ICA

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Introduction to Machine Learning

Machine Learning - MT & 14. PCA and MDS

Dimensionality Reduction

What is Principal Component Analysis?

Unsupervised learning: beyond simple clustering and PCA

LECTURE :ICA. Rita Osadchy. Based on Lecture Notes by A. Ng

Expectation Maximization

Dimensionality reduction

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK

Principal Component Analysis (PCA)

Statistical Machine Learning

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

PCA and admixture models

Deriving Principal Component Analysis (PCA)

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

Advanced Introduction to Machine Learning CMU-10715

Lecture: Face Recognition and Feature Reduction

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning 11. week

Machine Learning 2nd Edition

Principal Component Analysis (PCA)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Principal Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

7. Variable extraction and dimensionality reduction

Announcements (repeat) Principal Components Analysis

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Dimensionality Reduction Using PCA/LDA. Hongyu Li School of Software Engineering TongJi University Fall, 2014

STA 414/2104: Lecture 8

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Lecture 7: Con3nuous Latent Variable Models

EECS 275 Matrix Computation

Principal Component Analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

HST.582J/6.555J/16.456J

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

STATS 306B: Unsupervised Learning Spring Lecture 12 May 7

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Example: Face Detection

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

STA 414/2104: Lecture 8

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Signal Analysis. Principal Component Analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Kernel methods for comparing distributions, measuring dependence

COMP 551 Applied Machine Learning Lecture 13: Dimension reduction and feature selection

1 Principal Components Analysis

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Covariance and Correlation Matrix

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Regularized Discriminant Analysis and Reduced-Rank LDA

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Machine Learning (BSMC-GA 4439) Wenke Liu

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Unsupervised Learning: K- Means & PCA

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

PRINCIPAL COMPONENT ANALYSIS

Learning Eigenfunctions: Links with Spectral Clustering and Kernel PCA

Computation. For QDA we need to calculate: Lets first consider the case that

Data Mining Techniques

Unsupervised Learning

20 Unsupervised Learning and Principal Components Analysis (PCA)

L11: Pattern recognition principles

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

Introduction PCA classic Generative models Beyond and summary. PCA, ICA and beyond

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Artificial Intelligence Module 2. Feature Selection. Andrea Torsello

Maximum variance formulation

PCA Review. CS 510 February 25 th, 2013

Data Mining. Linear & nonlinear classifiers. Hamid Beigy. Sharif University of Technology. Fall 1396

Some Interesting Problems in Pattern Recognition and Image Processing

Linear Dimensionality Reduction

STA 414/2104: Machine Learning

Feature Extraction Techniques

LECTURE NOTE #10 PROF. ALAN YUILLE

Principal components analysis COMS 4771

CSC 411 Lecture 12: Principal Component Analysis

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Dimensionality Reduction

CS 4495 Computer Vision Principle Component Analysis

Transcription:

PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani

Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given feature set Feature extraction A linear or non-linear transform on the original feature space x 1 x d x i1 x id x 1 x d y 1 y d = f x 1 x d Feature Selection (d < d) Feature Extraction 2

Feature Extraction Mapping of the original data to another space Criterion for feature extraction can be different based on problem settings Unsupervised task: minimize the information loss (reconstruction error) Supervised task: maximize the class discrimination on the projected space Feature extraction algorithms Linear Methods Unsupervised: e.g., Principal Component Analysis (PCA) Supervised: e.g., Linear Discriminant Analysis (LDA) Also known as Fisher s Discriminant Analysis (FDA) 3

Feature Extraction Unsupervised feature extraction: X = x 1 (1) (1) x d (N) x d x 1 (N) Feature Extraction Supervised feature extraction: A mapping f: R d R d Or only the transformed data X = x 1 (1) (1) x d x d x 1 (N) (N) X = x 1 (1) (1) x d (N) x d x 1 (N) Y = y (1) y (N) Feature Extraction A mapping f: R d R d Or only the transformed data X = x 1 (1) (1) x d x d x 1 (N) (N) 4

Unsupervised Feature Reduction Visualization: projection of high-dimensional data onto 2D or 3D. Data compression: efficient storage, communication, or and retrieval. Pre-process: to improve accuracy by reducing features As a preprocessing step to reduce dimensions for supervised learning tasks Helps avoiding overfitting Noise removal E.g, noise in the images introduced by minor lighting variations, slightly different imaging conditions, 5

Linear Transformation For linear transformation, we find an explicit mapping f x = A T x that can transform also new data vectors. Original data Type equation here. = reduced data A T R d d x R d x R x = A T x d < d 6

Linear Transformation Linear transformation are simple mappings T xa x ( x a x) j j T j = 1,, d A = a 11 a 1d a d1 a dd a1 ad 7

Linear Dimensionality Reduction Unsupervised Principal Component Analysis (PCA) [we will discuss] Independent Component Analysis (ICA) [we will discuss] SingularValue Decomposition (SVD) Multi Dimensional Scaling (MDS) Canonical Correlation Analysis (CCA) 8

Principal Component Analysis (PCA) Also known as Karhonen-Loeve (KL) transform Principal Components (PCs): orthogonal vectors that are ordered by the fraction of the total information (variation) in the corresponding directions Find the directions at which data approximately lie When the data is projected onto first PC, the variance of the projected data is maximized 9

Principal Component Analysis (PCA) The best linear subspace (i.e. providing least reconstruction error of data): 10 Find mean reduced data The axes have been rotated to new (principal) axes such that: Principal axis 1 has the highest variance... Principal axis i has the i-th highest variance. The principal axes are uncorrelated Covariance among each pair of the principal axes is zero. Goal: reducing the dimensionality of the data while preserving the variation present in the dataset as much as possible. PCs can be found as the best eigenvectors of the covariance matrix of the data points.

Principal components If data has a Gaussian distribution N(μ, Σ), the direction of the largest variance can be found by the eigenvector of Σ that corresponds to the largest eigenvalue of Σ 5 4 v 2 v 1 3 2 4.0 4.5 5.0 5.5 6.0 11

Covariance Matrix μ x = μ 1 μ d = E(x 1 ) E(x d ) Σ = E x μ x x μ x T ML estimate of covariance matrix from data points x (i) N i=1 : N Σ = 1 N i=1 x (i) μ x (i) μ T = 1 N XT X X = x (1) x (N) = x (1) μ x (N) μ N μ = 1 N i=1 x (i) 12 Mean-centered data

PCA: Steps Input: N d data matrix X (each row contain a d dimensional data point) μ = 1 N i=1 N x (i) X Mean value of data points is subtracted from rows of X Σ = 1 N XT X (Covariance matrix) Calculate eigenvalue and eigenvectors of Σ Pick d eigenvectors corresponding to the largest eigenvalues and put them in the columns of A = [v 1,, v d ] X = XA First PC d -th PC 13

Another Interpretation: Least Squares Error PCs are linear least squares fits to samples, each orthogonal to the previous PCs: First PC is a minimum distance fit to a vector in the original feature space Second PC is a minimum distance fit to a vector in the plane perpendicular to the first PC 14

Another Interpretation: example 15

Another Interpretation: example 16

Least Squares Error and Maximum Variance Views Are Equivalent (1-dim Interpretation) Minimizing sum of square distances to the line is equivalent to maximizing the sum of squares of the projections on that line (Pythagoras). origin 17

PCA: Uncorrelated Features x = A T x R x = E x x T = E A T xx T A = A T E xx T A = A T R x A If A = [a 1,, a d ] where a 1,, a d are orthonormal eighenvectors of R x : R x = A T R x A = A T AΛA T A = Λ i j i, j = 1,, d E x i x j = 0 then mutually uncorrelated features are obtained Completely uncorrelated features avoid information redundancies 18

PCA Derivation (Correlation Version): Mean Square Error Approximation Incorporating all eigenvectors in A = [a 1,, a d ]: x = A T x Ax = AA T x = x x = Ax If d = d then x can be reconstructed exactly from x 19

PCA Derivation (Correlation Version): Relation between Eigenvalues and Variances The j-th largest eigenvalue of R x is the variance on the j-th PC: var x j = a j T R x a j = λ j 20

PCA Derivation (Correlation Version): Mean Square Error Approximation Incorporating only d eigenvectors corresponding to the largest eigenvalues A = [a 1,, a d ] (d < d) It minimizes MSE between x and x = Ax : 21 J A = E x x 2 = E x Ax 2 = E = E = d j=d +1 k=d +1 d j=d +1 d a j T E xx T a j = x j a j T a k x k = E d j=d +1 d j=d +1 a j T R x a j = d j=d +1 x j 2 = d j=d +1 x j a j d j=d +1 λ j 2 E x j 2 Sum of the d d smallest eigenvalues

PCA Derivation (Correlation Version): Mean Square Error Approximation In general, it can also be shown MSE is minimized compared to any other approximation of x by any d -dimensional orthonormal basis without first assuming that the axes are eigenvectors of the correlation matrix, this result can also be obtained. If the data is mean-centered in advance, R x and C x (covariance matrix) will be the same. However, in the correlation version when C x R x the approximation is not, in general, a good one (although it is a minimum MSE solution) 22

PCA on Faces: Eigenfaces ORL Database 23 Some Images

PCA on Faces: Eigenfaces Average face 1 st PC 6 th PC For eigen faces gray = 0, white > 0, black < 0 24

PCA on Faces: x is a 112 92 = 10304 dimensional vector containing intensity of the pixels of this image Feature vector=[x 1,x 2,,x d ] x i = PC i T x The projection of x on the i-th PC = +x 1 +x 2 + +x 256 Average Face 25

PCA on Faces: Reconstructed Face d'=1 d'=2 d'=4 d'=8 d'=16 d'=32 d'=64 d'=128 d'=256 Original Image 26

Kernel PCA Kernel extension of PCA data (approximately) lies on a lower dimensional non-linear space 27

PCA and LDA: Drawbacks PCA drawback: An excellent information packing transform does not necessarily lead to a good class separability. The directions of the maximum variance may be useless for classification purpose LDA PCA LDA drawback 28 Singularity or under-sampled problem (when N < d) Example: gene expression data, images, text documents Can reduces dimension only to d C 1 (unlike PCA)

PCA vs. LDA Although LDA often provide more suitable features for classification tasks, PCA might outperform LDA in some situations: when the number of samples per class is small (overfitting problem of LDA) when the training data non-uniformly sample the underlying distribution when the number of the desired features is more than C 1 Advances in the last decade: Semi-supervised feature extraction E.g., PCA+LDA, Regularized LDA, Locally FDA (LFDA) 29

Independent Component Analysis (ICA) PCA: The transformed dimensions will be uncorrelated from each other Orthogonal linear transform Only uses second order statistics (i.e., covariance matrix) ICA: The transformed dimensions will be as independent as possible. Non-orthogonal linear transform High-order statistics can also used 30

Uncorrelated and Independent Gaussian Independent Uncorrelated Uncorrelated: cov X 1,X 2 = 0 Independent: P X 1,X 2 = P(X 1 )P(X 2 ) Non-Gaussian Independent Uncorrelated Uncorrelated Independent 31

ICA: Cocktail party problem Cocktail party problem d speakers are speaking simultaneously and any microphone records only an overlapping combination of these voices. Each microphone records a different combination of the speakers voices. Using these d microphone recordings, can we separate out the original d speakers speech signals? Mixing matrix A: Unmixing matrix A 1 : x = As s = A 1 x s j (i) : sound that speaker j was uttering at time i. 32 x j (i) : acoustic reading recorded by microphone j at time i.

ICA: Ambiguities We cannot determine the order of the independent components. There will be no way to distinguish between As and AP 1 Ps (P is a permutation matrix) There is no way to recover the correct scaling of sources There will be no way to distinguish between As and A/α αs Gaussian distribution N(0, I) of sources The distribution of any orthonormal transformation of the Gaussian has exactly the same distribution as that Gaussian 33

Gaussian distribution N(0, I) of sources x = As x = A s A = AR RR T = R T R = I E x x T = E ARR T A T = AA T = E xx T There is no way to tell if sources were mixed using A or A an arbitrary rotational component that cannot be determined from the data, and thus we cannot recover the original sources. So long as the data is not Gaussian, it is possible, given enough data, to recover the d independent sources. 34

ICA: Assumptions p x x = p s (A 1 x) A 1 Consider the assumption that the sources are independent: p s = d i=1 p(s i ) W = A 1 s i = w i T x What we ll choose for the cdf, sigmoid function σ(s) = 1/(1 + e s ) as a reasonable default function Hence, p(s) = σ (s). 35

ICA: Log likelihood l W = n i=1 d j=1 log σ w j T x (i) + log W Stochastic gradient descent: 1 2σ w T 1 x (i) W = W + α x i 1 2σ w T d x (i) T + W 1 T W W = W W 1 T 36

Summary PCA is a linear dimensionality reduction method that finds an orthonormal basis (minimizing MSE) ICA finds a linear transformation to make components as independent as possible 37