LECTURE NOTE #10 PROF. ALAN YUILLE

Similar documents
LECTURE NOTE #11 PROF. ALAN YUILLE

LECTURE NOTE #8 PROF. ALAN YUILLE. Can we find a linear classifier that separates the position and negative examples?

EECS 275 Matrix Computation

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Regularized Discriminant Analysis and Reduced-Rank LDA

L11: Pattern recognition principles

1 Principal Components Analysis

Mathematical foundations - linear algebra

Clustering VS Classification

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Lecture 5 Supspace Tranformations Eigendecompositions, kernel PCA and CCA

Fisher s Linear Discriminant Analysis

Maximum variance formulation

Machine Learning (Spring 2012) Principal Component Analysis

Linear Dimensionality Reduction

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

PCA and LDA. Man-Wai MAK

Pattern Recognition 2

COMP 558 lecture 18 Nov. 15, 2010

Dimensionality Reduction and Principal Components

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis (PCA)

Lecture: Face Recognition and Feature Reduction

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Computation. For QDA we need to calculate: Lets first consider the case that

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Problem # Max points possible Actual score Total 120

14 Singular Value Decomposition

Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Least Squares Optimization

Principal Component Analysis

PCA and LDA. Man-Wai MAK

Least Squares Optimization

Linear Models for Classification

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Motivating the Covariance Matrix

Review problems for MA 54, Fall 2004.

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Advanced Introduction to Machine Learning CMU-10715

Signal Analysis. Principal Component Analysis

PCA, Kernel PCA, ICA

Chapter 3 Transformations

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Gaussian Models (9/9/13)

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Linear Algebra Methods for Data Mining

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Linear Algebra in Computer Vision. Lecture2: Basic Linear Algebra & Probability. Vector. Vector Operations

Machine Learning Lecture 7

Lecture 7: Con3nuous Latent Variable Models

Review of Some Concepts from Linear Algebra: Part 2

DS-GA 1002 Lecture notes 10 November 23, Linear models

Statistical Pattern Recognition

CS 195-5: Machine Learning Problem Set 1

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 6 Positive Definite Matrices

Fisher Linear Discriminant Analysis

Least Squares Optimization

Introduction to Machine Learning

2. Review of Linear Algebra

Mathematical foundations - linear algebra

Principal Component Analysis and Linear Discriminant Analysis

Linear Methods for Prediction

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Principal Component Analysis (PCA) CSC411/2515 Tutorial

1 Linearity and Linear Systems

University of Cambridge Engineering Part IIB Module 4F10: Statistical Pattern Processing Handout 2: Multivariate Gaussians

Lecture 7 Spectral methods

Dimensionality Reduction and Principle Components

Conceptual Questions for Review

APPENDIX A. Background Mathematics. A.1 Linear Algebra. Vector algebra. Let x denote the n-dimensional column vector with components x 1 x 2.

STA 414/2104: Lecture 8

STATS306B STATS306B. Discriminant Analysis. Jonathan Taylor Department of Statistics Stanford University. June 3, 2010

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Machine learning for pervasive systems Classification in high-dimensional spaces

Background Mathematics (2/2) 1. David Barber

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

15 Singular Value Decomposition

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

STA 414/2104: Machine Learning

Exercises * on Principal Component Analysis

The Kernel Trick, Gram Matrices, and Feature Extraction. CS6787 Lecture 4 Fall 2017

APPLICATIONS The eigenvalues are λ = 5, 5. An orthonormal basis of eigenvectors consists of

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Algebra Workshops 10 and 11

Linear & nonlinear classifiers

Principal Component Analysis

6.867 Machine Learning

Foundations of Computer Vision

Math 1553, Introduction to Linear Algebra

STATISTICAL LEARNING SYSTEMS

Data Preprocessing Tasks

Transcription:

LECTURE NOTE #10 PROF. ALAN YUILLE 1. Principle Component Analysis (PCA) One way to deal with the curse of dimensionality is to project data down onto a space of low dimensions, see figure (1). Figure 1 There are a number of different techniques for doing this. Now, we discuss the most basic method - Principle Component Analysis(PCA) 1

2 PROF. ALAN YUILLE 2. Convention µ T µ is a scalar µ 2 1 + µ2 2 + + µ2 D µ 2 1 µ 1 µ 2 µ 1 µ 3 µ µ T is a matrix. µ 2... 2..... Data samples x 1,..., x N in D-dimension space. Compute the mean µ = 1 N N i=1 x i Compute the covariance K = 1 N N i=1 (x i µ)(x i µ) T Next compute the eigenvalues and eigenvector of K Solve Ke = λe λ 1 λ 2 λ N N ote : K is a symmetric matrix- so eigenvalues are real, eigenvectors are orthogonal. e nu e µ = 1 if µ = ν, and = 0 otherwise. Also, by construction, the matrix K is positive semi-definite, so λ N 0 (i.e. no eigenvalues are negative). PCA reduces the dimension by projection the data onto a space spanned by the eigenvectors e i with λ i > T, where T is a threshold Let M eigenvectors be kept. Then project data x onto the subspace spanned by the first M eigenvectors, after subtracting out the mean.

LECTURE NOTE #10 3 3. Formally: Express x µ = D ν=1 a νe ν where the coefficients {a ν } are given by a ν = (x µ) e ν Orthogonality, mean e ν e µ = δ νµ Kronecker delta Hence: x = µ + D ν=1 (x µ) e ν e ν and there is no dimension reduction (no compression) Then, approximate x µ + M ν=1 (x µ) e ν e ν This Projects the data into the M-dimension subspace of form: µ + M ν=1 b νe ν

4 PROF. ALAN YUILLE 4. In 2-dimensions V isually Figure 2. In two-dimensions, the eigenvectors give the principle axes of the data. The eigenvector of K corresponds to the second order movements of the data, see figure (2). If the data lies (almost) on a straight line, then λ 1 0, λ 2 0, see figure (3). Figure 3. In two-dimensions, if the data lies along a line then λ 1 > 0 and λ 2 0.

LECTURE NOTE #10 5 5. PCA and Gaussian Distribution PCA is equivalent to performing ML estimation of the parameters of a Gaussian 1 1 p(x µ, Σ) = (2π) D/2 det Σ e z (x µ)t 1(x µ) to get ˆµ, ˆΣ by performing ML on i p(x i µ, Σ). And then throw away the directions where the standard deviation is small. ML gives µ = 1 N N i=1 x i and Σ = 1 N N i=1 ( x i mu)( x i µ) T. See Bishop s book for probabilistic PCA. 6. Cost Function for PCA J(M, {a}, {e}) = N k=1 (µ + M i=1 a kie i ) x k 2 ) Minimize J w.r.t. M, {a}, {e} Data {x k : k = 1 to N} The {a ki } are projection coefficients Intuition: find the M-dimensional subspace s.t. the projections of the data onto this subspace have minimal error, see figure (4). Minimizing J, gives the {ê i } s to be the eigenvectors of the covariance matrix K = 1 N N k=1 (x k µ)(x k µ) T µ = 1 N k=1 N x k â ki = (x k ˆµ) e i the projection coefficients.

6 PROF. ALAN YUILLE Figure 4. PCA can be obtained as the projection which minimizes the least square error of the residuals. 7. To understand this fully, you must understand Singular Value Decomposition (SVD) We can re-express the criteria as J[µ, {a}, {e}] = N D k=1 b=1 {(µ b x bk ) + M i=1 a kie ib )} 2 where b denotes the vector component. This is an example of a general class of problem. Let E[Ψ, e] = a=d,k=n a=1,k=1 ( x ak M ν=1 Ψ aνφ νk ) 2 Goal: minimize E[Ψ, e] w.r.t. Ψ, e This is a bilinear problem, that can be solved by SVD. Note: x ak = x ak µ a the position of the point, relative to the mean.

LECTURE NOTE #10 7 8. Singular Value Decomposition SVD Note: X is not a square matrix (unless D = N). So it has no eigenvalues or eigenvectors. We can express any N D matrix X, x ak in form X = E D F x ak = M µ,ν=1 e aµd µν f νk where D = {d µν } is a diagonal matrix (d µν = 0, µ ν), D = λ1 0..... 0 λn, where the {λ i } are eigenvalues of X X T (equivalently of X T X) E = {e aµ } are eigenvectors of (X X T ) ab F = {f νk } are eigenvectors of (X T X) kl, * µ&ν label the eigenvectors. Note: For X defined on previous page, we get that ( X X T ) = N k=1 (x k µ)(x k µ) T

8 PROF. ALAN YUILLE Note: if (X X T )e = λe then (X T X)(X T e) = λ(x T e) This relates the eigenvectors of X X T and of X T X (calculate the eigenvectors for the smallest matrix, then deduce those of the bigger matrix - usually D < N) 9. Minimize: E[ψ, e] = a=d,k=n a=1,k=1 we set { ψaν = δ νν e ν a φ νk = δ νν f ν k ( x ak M ν=1 ψ aνφ νk ) 2 Take M biggest terms in the SVD expansion of x But there is an ambiguity. M ν=1 ψ aνφ νk = (ψφ) ak = (ψaa 1 φ) ak for any M M invertible matrix A ψ ψa φ A 1 φ For the PCA problem - we have constants that the projection directions one orthogonal unit eigenvectors. This gets rid of the ambiguity.

LECTURE NOTE #10 9 10. Relate SVD to PCA (Linear Algebra) Start with an n m matrix X XX T is a symmetric n n matrix X T X is a symmetric m m matrix ((XX T ) T = XX T ) By standard linear algebra XX T e µ = λ µ e µ n eigenvalues λ µ eigenvectors e µ eigenvectors are orthogonal e µ e ν = δ µν (= 1 if µ = ν, = 0 if µ ν). Similarly, XX T f ν = τ ν f ν m eigenvalues τ ν eigenvectors f ν f µ f ν = δ µν The {e µ } and {f ν } are related because (X T X)(X T e µ ) = λ µ (X T e µ )

10 PROF. ALAN YUILLE (xx T )(xf µ ) = τ µ (xf µ ) Hence: X T e µ f µ, Xf µ e µ λ µ = τ µ If n >m, then there are n eigenvectors {e µ } and m eigenvectors {f µ }. So several {e µ } relate to the same f µ. 11. Claim: we can express X = µ αµ e µ f µt X T = µ αµ f µ e µt For some α µ. (we will solve for α µ later.) Xf ν = µ αµ e µ f µt f ν = µ αµ δ µν e µ = α ν e ν Verify the claim XX T = µ,ν αν e ν f ν T α µ f µ e µt = µ,ν αν α µ e ν δ µν e µt = µ (αµ ) 2 e µ e µt

Similarly LECTURE NOTE #10 11 x T x = µ (αµ ) 2 f µ f µt So (α µ ) 2 = λ µ (Because we can express any symmetric matrix in form µ λ µe µ e µt, where λ µ are the eigenvalues and e µ are eigenvectors.) X = µ αµ e µ f µt is the SVD of X In coordinates: x ai = µ αµ e µ af µ i x ai = µ,ν eµ aα µ δ µν f ν i x = EDF E aµ = e µ a, D = α µ δ µν, F νi = fi ν µν 12. Effectiveness of PCA In practice, PCA often reduces the data dimension a lot. But it will not be effective for some problems. For example, if the data B a set of strings (1, 0, 0, 0,... ) = x 1 (0, 1, 0, 0,... ) = x 2 (0, 0, 0, 0,..., 0, 1) = x N

12 PROF. ALAN YUILLE then it can be computed that there is one zero eigenvalue of PCA. But all the other eigenvalues are not small. In general, PCA works best if there is a linear structure to the data. It works poorly if the data lies on a curved surface and not on a flat surface. 13. Fisher s Linear Discriminant PCA may not be the best way to reduce the dimension if the goal is discrimination. Suppose you want to discriminate between two classes of data 1&2, shown in figure (5). Figure 5. This type of data is bad for PCA. Fisher s Linear Discriminant does better of the goal is discrimination. If you put both sets of data into PCA, you will get this, see figure (6). The eigenvectors are e 1, e 2 with eigenvalues λ 1 > λ 2. Because of the form of the data λ 1 >> λ 2. The best axis, according to PCA is in the worst direction for discrimination (best axis is e 1 because λ 1 >> λ 2 ).

LECTURE NOTE #10 13 Figure 6. The PCA projections for the data in figure (5) The best axis, according to PCA, is the worst axis for projection if the goal is discrimination. Projecting datasets onto e 1 gives, see figure (13): Figure 7. If we project the data onto e 1, then data 1 and 2 gets all mixed together. Very bad for discrimination. The second direction e 1 would be for better. This would give, see figure (13): Figure 8. We get a better discrimination if we project the data onto e 2, then it is easy to separate data 1 from data 2.

14 PROF. ALAN YUILLE 14. Fisher s Linear Discriminant gives a way to find a better projection direction. n 1 samples x i from class X 1 n 2 samples x i from class X 2 Goal: find a vector w, project data onto this axis (i.e. x i w) so that the data is well separated. Define the sample mean m i = 1 N i x X i x for i = 1, 2. S i Define scatter matrices = x X i (x m i )(x m i ) T for i = 1, 2. S B Define the between-class scatter = (x m i )(x m i ) T between class X 1 and X 2. S W Finally define the within-class scatter = S + S 1 2 15. Now project onto the (unknown) direction w ˆm i = 1 N i x X i w x = x m i, using the definitions of the sample means.

LECTURE NOTE #10 15 The means of the projections are the projections of the means. The scatter of the projected points is Ŝi 2 = x X i (w x w m i ) 2 = w T S w, by definition of the scatter matrices. i Fisher s criterion choose the projection direction w to minimize: J(w) = ˆm 1 ˆm 2 2 Ŝ 2 1 +Ŝ2 2 This maximizes the ratio of the between-class distance ( ˆm 1 ˆm 2 ) to the within-class scatter. 16. This is a good projection direction, see figure (9), while other projection directions are bad see figure (10). Figure 9. The projection of the data from figure (5) onto the best direction w (at roughly forty-five degrees). This separates the data well because the distance between the projected means m 1, m 2 is a lot bigger than the projected scatters Ŝ1, Ŝ2.

16 PROF. ALAN YUILLE Figure 10. A bad projection direction, by comparison to figure (9). The distance between the projected means m 1, m 2 is smaller than the projected scatters Ŝ1, Ŝ2. Result : The projection direction that maximize J(ω) is ω = S 1 (m 1 m 2 ). Note that this is not normalized (i.e. ω 1), so we must normalize the vector. ω Proof Note that the Fisher criterion is independent of the norm of ω. So we can maximize it by arbitrarily requiring that ω T S ω = τ, where τ is a constant. This can be formulated in term of maximization with constraints: maximize ω T S ω λ(ω T S ω τ) B ω * λ : Lagrange multiplier * τ : constant ω δ δω S λs ω = 0 Bω ω Hence S 1 S ω = λω ω B

But LECTURE NOTE #10 17 S = (m 1 m 2 ) T (m 1 m 2 ) B S ω = ρ(m 1 m 2 ) for some ρ (= w ( m 1 m 2 )). B Hence S B ˆω (m 1 m 2 ). This implies that S 1 S follows from recalling that S 1 S ω = λω. ω B w B ˆω S 1 (m 1 m 2 ), and the result w 17. Fishev s Linear Discriminant An alternative way to model this problem is to assign a Gaussian model to each dataset (i.e. learn the model parameters µ, Σ for each dataset). Then if the covariance is the same for both datasets then the the Bayes classifier is a straight line whose normal is the direction ω ω x + ω 0 = 0, ω = Σ(µ 1 µ 2 ). This is exactly the same as Fisher s method! See figure (11). But if the data comes from two Gaussian with different covariances, then Bayes classifier is a quadratic curve, so it differs from Fisher s linear discriminant. 18. Multiple Classes For c classes, compute c 1 discriminants project D-dimensional feature with c 1 space.

18 PROF. ALAN YUILLE Figure 11. Try to model the datasets by learning Gaussian models for each dataset separately. Then we recover Fisher if both datasets have the same covariance. The decision plane will have a surface normal which points along the direction of Fisher s ω. But if the two datasets have different covariances, then the Bayes classifier will be a quadratic curve and differs from Fisher. Without-class S = S + + S ω 1 c 1 Between-class S = S S = c i=1 n i (m i m)(m i m) T B total ω S is the scatter matrix for all the classes. total Multiple Discriminant Analysis Seek vectors ω i : i = 1,..., c 1 Project samples to c 1 dim space: (ω 1 x,..., ω c 1 x) = ω T x. ω B ω T S ω ω Criteria is J(ω) = ωt S. is the determinant

LECTURE NOTE #10 19 19. S B The solution is given by the eigenvectors, where eigenvalues are the c 1 largest in W = λs w w Figure 12. w 1 is a good project direction, w 2 is a bad projection direction. w 1 is a good projection of the data, w 2 is a bad projection, see figure (12) Limitations of PCA and Fisher It is important to realize the limitations of these methods and also why they are popular. They are popular party because they are easy to implement. They both have optimization criteria which can be solved by linear algebra. This is because the optimization criteria are quadratic, so the solutions are linear. This restriction was necessary when computers did not exist. But now it is possible to have other optimization criteria which can be solved by computers. For example, based on criteria like nearest neighbour classification. This will be discussed later.

20 PROF. ALAN YUILLE Also PCA assumes that the data lies on a low-dimensional linear space. But what if it lies on a low-dimensional curved space? More advanced techniques can deal with this e.g. ISOMAP.