Linear Dimensionality Reduction

Similar documents
Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Machine Learning 2nd Edition

PCA and LDA. Man-Wai MAK

PCA and LDA. Man-Wai MAK

CS281 Section 4: Factor Analysis and PCA

Principal Component Analysis and Linear Discriminant Analysis

STATISTICAL LEARNING SYSTEMS

Statistical Pattern Recognition

Machine Learning (Spring 2012) Principal Component Analysis

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

CSC 411 Lecture 12: Principal Component Analysis

Introduction to Machine Learning

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Fisher s Linear Discriminant Analysis

Dimensionality Reduction

Lecture 7: Con3nuous Latent Variable Models

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

STA 414/2104: Machine Learning

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Nonlinear Dimensionality Reduction

Statistical Machine Learning

Pattern Recognition 2

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Algebra Methods for Data Mining

L11: Pattern recognition principles

Data Mining and Analysis: Fundamental Concepts and Algorithms

L26: Advanced dimensionality reduction

14 Singular Value Decomposition

Dimension Reduction and Low-dimensional Embedding

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Preprocessing & dimensionality reduction

Face Recognition Using Laplacianfaces He et al. (IEEE Trans PAMI, 2005) presented by Hassan A. Kingravi

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

1 Principal Components Analysis

Dimensionality Reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Computation. For QDA we need to calculate: Lets first consider the case that

7. Variable extraction and dimensionality reduction

LEC 3: Fisher Discriminant Analysis (FDA)

Regularized Discriminant Analysis and Reduced-Rank LDA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

ECE 661: Homework 10 Fall 2014

Maximum variance formulation

LECTURE NOTE #11 PROF. ALAN YUILLE

Principal Components Theory Notes

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Notes on Implementation of Component Analysis Techniques

Principal Components Analysis. Sargur Srihari University at Buffalo

Linear & nonlinear classifiers

STA 414/2104: Lecture 8

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Unsupervised Learning: Dimensionality Reduction

Mathematical foundations - linear algebra

PCA and admixture models

Advanced Introduction to Machine Learning CMU-10715

Principal Component Analysis

LECTURE NOTE #10 PROF. ALAN YUILLE

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

PCA, Kernel PCA, ICA

Clustering VS Classification

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

9.1 Orthogonal factor model.

Feature selection and extraction Spectral domain quality estimation Alternatives

Principal Component Analysis

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Dimensionality Reduction with Principal Component Analysis

Unsupervised dimensionality reduction

Motivating the Covariance Matrix

Factor Analysis (10/2/13)

Principal Component Analysis (PCA)

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

3.1. The probabilistic view of the principal component analysis.

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Machine Learning - MT & 14. PCA and MDS

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Machine Learning (BSMC-GA 4439) Wenke Liu

Lecture 7 Spectral methods

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

IN Pratical guidelines for classification Evaluation Feature selection Principal component transform Anne Solberg

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

7 Principal Component Analysis

Principal Component Analysis (PCA)

Connection of Local Linear Embedding, ISOMAP, and Kernel Principal Component Analysis

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Machine detection of emotions: Feature Selection

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

What is Principal Component Analysis?

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Advanced data analysis

Introduction to Graphical Models

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Transcription:

Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012)

Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis 4 Multidimensional Scaling 5 Linear Discriminant Analysis

Why Dimensionality Reduction? The number of inputs (input dimensionality) often affects the time and space complexity of the learning algorithm. Eliminating an input saves the cost of extracting it. Simpler models are often more robust on small data sets. Simpler models are more interpretable, leading to simpler explanation. Data visualization in 2 or 3 dimensions facilitates the detection of structure and outliers.

Feature Selection vs. Extraction Feature selection: Choosing K < D important features and discarding the remaining D K. Subset selection algorithms Feature extraction: Projecting the original D dimensions to K (< D) new dimensions. Unsupervised methods (without using output information): Principal component analysis (PCA) Factor analysis (FA) Multidimensional scaling (MDS) Supervised methods (using output information): Linear discriminant analysis (LDA) The linear methods above also have nonlinear extensions.

Principal Component Analysis PCA finds a linear mapping from the D-dimensional input space to a K -dimensional space (K < D) with minimum information loss according to some criterion. Projection of x on the direction of w: z = w T x Finding the first principal component w 1 s.t. var(z 1 ) is maximized: var(z 1 ) = var(w T 1 x) = E[(w T 1 x w T 1 µ) 2 ] = E[w T 1 (x µ)(x µ) T w 1 ] = w T 1 E[(x µ)(x µ) T ]w 1 = w T 1 Σw 1 where cov(x) = E[(x µ)(x µ) T ] = Σ

Optimization Problem for First Principal Component Maximization of var(z 1 ) subject to w 1 = 1 can be solved as a constrained optimization problem using a Lagrange multiplier. Maximization of Lagrangian: w T 1 Σw 1 α(w T 1 w 1 1) Taking the derivative of the Lagrangian w.r.t. w 1 and setting it to 0, we get an eigenvalue equation for the first principal component w 1 : Σw 1 = αw 1 Because we have w T 1 Σw 1 = αw T 1 w 1 = α we choose the eigenvector with the largest eigenvalue for the variance to be maximum.

Optimization Problem for Second Principal Component The second principal component w 2 should also maximize the variance var(z 2 ), subject to the constraints that w 2 = 1 and that w 2 is orthogonal to w 1. Maximization of Lagrangian: w T 2 Σw 2 α(w T 2 w 2 1) β(w T 2 w 1 0) Taking the derivative of the Lagrangian w.r.t. w 2 and setting it to 0, we get the following equation:: 2Σw 2 2αw 2 βw 1 = 0 We can show that β = 0 and hence have this eigenvalue equation: Σw 2 = αw 2 implying that w 2 is the eigenvector of Σ with the second largest eigenvalue.

What PCA Does Transformation of data: z = W T (x m) where the columns of W are the eigenvectors of Σ and m is the sample mean. Centering the data at the origin and rotating the axes: If the variance on z 2 is too small, it can be ignored to reduce the dimensionality from 2 to 1.

How to Choose K Proportion of variance (PoV) explained: λ 1 + λ 2 +... + λ K λ 1 + λ 2 +... + λ D where λ i are sorted in descending order. Typically, stop at PoV > 0.9 Scree graph plotting PoV against K ; stop at elbow.

Scree Graph

Scatterplot in Lower-Dimensional Space

Factor Analysis FA assumes that there is a set of latent factors z j which when acting in combination to generate the observed variables x. The goal of FA is to characterize the dependency among the observed variables by means of a smaller number of factors. Problem settings: Sample X = {x (i) }: drawn from some unknown probability density with E[x] = µ and cov(x) = Σ. Factors z j with unit normals: E[z j ] = 0, var(z j ) = 1, cov(z j, z k ) = 0, j k. Noise sources ɛ i : E[ɛ j ] = 0, var(ɛ j ) = Ψ j, cov(ɛ j, ɛ k ) = 0, j k, cov(ɛ j, z k ) = 0. Each of the D input dimensions x j is expressed as a weighted sum of the K (< D) factors z k plus some residual error term: x j µ j = K k=1 v jkz k + ɛ j where v jk are called factor loadings. Without loss of generality, we assume that µ = 0.

Relationships between Factors and Input Dimensions The factor z k are independent unit normals that are stretched, rotated and translated to generate the inputs x.

PCA vs. FA The direction of FA is opposite to that of PCA: PCA (from x to z): z = W T (x µ) FA (from z to z): x µ = Vz + ɛ

Covariance Matrix Given that var(z k ) = 1 and var(ɛ j ) = Ψ j var(x j ) = K k=1 v 2 jk var(z k) + var(ɛ j ) = K k=1 v 2 jk + Ψ j where the first part ( K k=1 v jk 2 ) is the variance explained by the common factors and the second part (Ψ j ) is the variance specific to x j. Covariance matrix: where Ψ = diag(ψ j ) Σ = cov(x) = cov(vz + ɛ) = cov(vz) + cov(ɛ) = Vcov(z)V T + Ψ = VV T + Ψ

2-Factor Example Let ( x = (x 1, x 2 ) T v11 v V = 12 v 21 v 22 ) Since ( σ11 σ Σ = 12 σ 21 σ 22 we have ) ( ) ( = VV T v11 v +Ψ = 12 v11 v 21 v 21 v 22 v 12 v 22 σ 12 = cov(x 1, x 2 ) = v 11 v 21 + v 12 v 22 ) ( Ψ1 0 + 0 Ψ 2 ) If x 1 and x 2 have high covariance, then they are related through a factor: If it is the first factor, then v 11 and v 21 will both be high. If it is the second factor, then v 12 and v 22 will both be high. If x 1 and x 2 have low covariance, then they depend on different factors: In each of the products v 11 v 21 and v 12 v 22, one term will be high and the other will be low.

Factor Loadings Because cov(x 1, z 1 ) = cov(v 11 z 1 + v 12 z 2 + ɛ 1, z 1 ) = cov(v 11 z 1 ) = v 11 var(z 1 ) = v 11 and similarity, cov(x 1, z 2 ) = v 12 cov(x 2, z 1 ) = v 21 cov(x 2, z 2 ) = v 22 so we have cov(x, z) = V i.e., the factor loadings V represent the covariance between the variables and the factors.

Dimensionality Reduction Given S as the estimator of Σ, we want to find V and Ψ such that S = VV T + Ψ If there are only a few factors (i.e., K D), then we can get a simplified structure for S. The number of parameters is reduced from D(D + 1)/2 (for S) to DK + D (for VV T + Ψ). Special cases: Probabilistic PCA (PPCA): Ψ = ΨI (i.e., all Ψ j are equal) PCA: Ψ j = 0 For dimensionality reduction, FA offers no advantage over PCA except the interpretability of factors allowing the identification of common causes, a simple explanation, and knowledge extraction.

Multidimensional Scaling Problem formulation: Given the pairwise distances between pairs of points in some space (but the exact coordinates of the points and their dimensionality are unknown). We want to embed the points in a lower-dimensional space such that the pairwise distances in this space are as close as possible to those in the original space. The projection to the lower-dimensional space is not unique because the pairwise distances are invariant to such operations as translation, rotation and reflection.

MDS Embedding of Cities

Derivation Sample X = {x (i) R D } N i=1 Squared Euclidean distance between points r and s: d 2 rs = x (r) x (s) 2 = = D j=1 j=1 (x (r) j x (s) j ) 2 D D (x (r) j ) 2 2 x (r) j x (s) j + j=1 D (x (s) j ) 2 j=1 where b rs = or in matrix form: D j=1 = b rr + b ss 2b rs x (r) j x (s) j (dot product of x (r) and x (s) ) B = XX T

Derivation (2) Centering of data to constrain the solution: N = 0, i = 1,..., D i=1 x (i) j Summing up equation (1) on r, s and both r, s, and defining N T = b ii = (x (i) j ) 2 i=1 i j we get drs 2 = T + Nb ss r drs 2 = Nb rr + T s drs 2 = 2NT r s

Derivation (3) By defining d s 2 = 1 d 2 N rs, dr 2 = 1 d 2 N rs, d 2 = 1 N 2 and using equation (1), we get r s r s d 2 rs b rs = 1 2 (d 2 r + d 2 s d 2 d 2 rs) B = XX T is positive semidefinite, so it can be expressed as its spectral decomposition: B = CDC T = CD 1/2 D 1/2 C T = (CD 1/2 )(CD 1/2 ) T where C is the matrix whose columns are the eigenvectors of B and D 1/2 is the diagonal matrix whose diagonal elements are the square roots of the eigenvalues.

Projection to Lower-Dimensional Space If we ignore the eigenvector of B with very small eigenvalues (the eigenvalues of B = XX T are the same as the eigenvalues of X T X), CD 1/2 will only be a low-rank approximation of X. Let c k be the k eigenvectors chosen with corresponding eigenvalues λ k. New dimensions in k-dimensional embedding space: z (i) k = λ k c (i) k So the new coordinates of instance i are given by the ith elements of the eigenvectors after normalization.

Linear Discriminant Analysis Unlike PCA, FA and MDS, LDA is a supervised dimensionality reduction method. LDA is typically used with a classifier for classification problems. Goal: the classes are well-separated after projecting to a low-dimensional space by utilizing the label information (output information).

Example

2-Class Case Given sample X = {(x (i), y (i) )}, where y (i) = 1 if x (i) C 1 and y (i) = 0 if x (i) C 2. Find vector w on which the data are projected such that the examples from C 1 and C 2 are as well separated as possible. Projection of x onto w (dimensionality reduced from D to 1): z = w T x m j R D and m j = R are sample means of C j before and after projection: m 1 = m 2 = i wt x (i) y (i) i y = w T m (i) 1 i wt x (i) (1 y (i) ) i (1 y = w T m (i) 2 )

Projection

Between-Class Scatter Between-class scatter: (m 1 m 2 ) 2 = (w T m 1 w T m 2 ) 2 = w T (m 1 m 2 )(m 1 m 2 ) T w = w T S B w where S B = (m 1 m 2 )(m 1 m 2 ) T

Within-Class Scatter Within-class scatter: s 2 1 = i = i (w T x (i) m 1 ) 2 y (i) w T (x (i) m 1 )(x (i) m 1 ) T wy (i) = w T S 1 w where S 1 = i (x(i) m 1 )(x (i) m 1 ) T y (i). Similarly, s 2 2 = wt S 2 w with S 2 = i (x(i) m 2 )(x (i) m 2 ) T (1 y (i) ). So s 2 1 + s 2 2 = w T S W w where S W = S 1 + S 2.

Fisher s Linear Discriminant Fisher s linear discriminant refers to the vector w that maximizes the Fisher criterion (a.k.a. generalized Rayleigh quotient): J(w) = (m 1 m 2 ) 2 s 2 1 + s2 2 = wt S B w w T S W w Taking the derivative of J w.r.t. w and setting it to 0, we obtain the following generalized eigenvalue problem: S B w = λs W w or, if S W is nonsingular, an equivalent eigenvalue problem: S 1 W S Bw = λw

Fisher s Linear Discriminant (2) Alternatively, for the 2-class case, we note that S B w = (m 1 m 2 )(m 1 m 2 ) T w = c(m 1 m 2 ) for some constant c and hence S B w (also S W w) is in the same direction of m 1 m 2. So we get w = S 1 W (m 1 m 2 ) = (S 1 + S 2 ) 1 (m 1 m 2 ) The constant factor is irrelevant and hence is discarded.

K > 2 Classes Find the matrix W R D K such that z = W T x R K Within-class scatter matrix for class C k : S k = i y (i) k (x(i) m k )(x (i) m k ) T where y (i) k = 1 if x (i) C k and 0 otherwise. Total within-class scatter matrix: S W = K k=1 S k

K > 2 Classes (2) Between-class scatter matrix: S B = K N k (m k m)(m k m) T k=1 where m is the overall mean and N k = i y (i) k. The optimal solution is the matrix W that maximizes J(W) = Tr(WT S B W) Tr(W T S W W) which corresponds to the eigenvectors of S 1 W S B with the largest eigenvalues. Take k K 1: since S B is the sum of K rank-1 matrices and only K 1 of them are independent, S B has a maximum rank of K 1.