EECS 275 Matrix Computation

Similar documents
EECS 275 Matrix Computation

EECS 275 Matrix Computation

EECS 275 Matrix Computation

EE731 Lecture Notes: Matrix Computations for Signal Processing

Problem set 5: SVD, Orthogonal projections, etc.

AMS526: Numerical Analysis I (Numerical Linear Algebra)

1 Principal Components Analysis

1 Singular Value Decomposition and Principal Component

Lecture 3: Review of Linear Algebra

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Principal Component Analysis (PCA) CSC411/2515 Tutorial

EECS 275 Matrix Computation

Principal Component Analysis

EECS 275 Matrix Computation

Singular Value Decomposition (SVD)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Lecture 3: Review of Linear Algebra

PCA and admixture models

Preprocessing & dimensionality reduction

Singular Value Decomposition and its. SVD and its Applications in Computer Vision

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

7. Variable extraction and dimensionality reduction

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Singular Value Decomposition

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Machine learning for pervasive systems Classification in high-dimensional spaces

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Lecture: Face Recognition and Feature Reduction

LECTURE NOTE #10 PROF. ALAN YUILLE

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis

Introduction to Machine Learning

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Lecture: Face Recognition and Feature Reduction

Let A an n n real nonsymmetric matrix. The eigenvalue problem: λ 1 = 1 with eigenvector u 1 = ( ) λ 2 = 2 with eigenvector u 2 = ( 1

Computational functional genomics

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

EECS 275 Matrix Computation

. = V c = V [x]v (5.1) c 1. c k

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

DS-GA 1002 Lecture notes 10 November 23, Linear models

IV. Matrix Approximation using Least-Squares

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Maximum variance formulation

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

MATH36001 Generalized Inverses and the SVD 2015

Problem # Max points possible Actual score Total 120

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

CSC 411 Lecture 12: Principal Component Analysis

PCA, Kernel PCA, ICA

What is Principal Component Analysis?

Signal Analysis. Principal Component Analysis

Principal Component Analysis

COMP 558 lecture 18 Nov. 15, 2010

The Singular Value Decomposition and Least Squares Problems

Numerical Methods I Singular Value Decomposition

CS 4495 Computer Vision Principle Component Analysis

Lecture 1: Review of linear algebra

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Eigenfaces. Face Recognition Using Principal Components Analysis

Mathematical foundations - linear algebra

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

MLCC 2015 Dimensionality Reduction and PCA

UNIT 6: The singular value decomposition.

Math 4A Notes. Written by Victoria Kala Last updated June 11, 2017

Positive Definite Matrix

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Statistical Machine Learning

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

Covariance and Correlation Matrix

The Singular Value Decomposition

Chapter 7: Symmetric Matrices and Quadratic Forms

Principal Component Analysis

MATH 304 Linear Algebra Lecture 20: The Gram-Schmidt process (continued). Eigenvalues and eigenvectors.

Tutorial on Principal Component Analysis

Review of similarity transformation and Singular Value Decomposition

Linear Systems. Carlo Tomasi. June 12, r = rank(a) b range(a) n r solutions

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

2. Review of Linear Algebra

EIGENVALE PROBLEMS AND THE SVD. [5.1 TO 5.3 & 7.4]

Singular Value Decomposition

LECTURE NOTE #11 PROF. ALAN YUILLE

Linear Methods in Data Mining

The Singular Value Decomposition

Novelty Detection. Cate Welch. May 14, 2015

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

σ 11 σ 22 σ pp 0 with p = min(n, m) The σ ii s are the singular values. Notation change σ ii A 1 σ 2

Singular Value Decomposition

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Math 407: Linear Optimization

Computation. For QDA we need to calculate: Lets first consider the case that

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Chapter 3 Transformations

Transcription:

EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 6 1 / 22

Overview Orthogonal projection, distance between subspaces Principal component analysis 2 / 22

Reading Chapter 6 of Numerical Linear Algebra by Llyod Trefethen and David Bau Chapter 2 of Matrix Computations by Gene Golub and Charles Van Loan Chapter 5 of Matrix Analysis and Applied Linear Algebra by Carl Meyer 3 / 22

Orthogonal projection Let S IR n be a subspace, P IR n n is the orthogonal projection (i.e., projector) onto S if ran(p) = S, P 2 = P, and P = P Mathematically, we have y = Px for some x, then Py = P 2 x = Px = y Example, in IR 3 P = 1 0 0 0 1 0 0 0 0, P x y z = x y 0, and P 2 x y z = x y 0 For orthogonal projection, P(Px x) = P 2 x Px = P(I P)x = 0 which means Px x null(p) If x IR n, then Px S and (I P)x S 4 / 22

Orthogonal projection If P is a projector, I P is also a projector, and I P 2 2 = I 2P + P 2 = I P The matrix I P is called complementary projector to P I P projects to the null space of P, i.e., ran(i P) = null(p) and, since P = I (I P), we have null(i P) = ran(p) and ran(p) null(p) = {0} If P 1 and P 2 are orthogonal projections, then for any z R n, we have (P 1 P 2 )z 2 2 = (P 1 z) (I P 2 )z + (P 2 z) (I P 1 )z If ran(p 1 ) = ran(p 2 ) = S, then the right hand side of the above equation is zero, i.e., the orthogonal projection for a subspace is unique 5 / 22

Orthogonal projection and SVD If the columns of V = [v 1,..., v k ] are an orthonormal basis for a subspace S, then it is easy to show that P = VV is the unique orthogonal projection onto S If v IR n, then P = vv is the orthogonal projection onto v v S = span({v}) Let A = UΣV IR m n and rank(a) = r, we have the U and V partitionings then U = [ U r Ũ ] V = [ V r Ṽ ] r m r r n r, U r Ur = projection onto ran(a) Ũ r Ũr = projection onto ran(a) = null(a ) V r Vr = projection onto null(a) = ran(a ) Ṽ r Ṽr = projection onto null(a) 6 / 22

Distances between subspaces Let S 1 and S 2 be subspaces of IR n and dim(s 1 ) = dim(s 2 ), we define the distance between two spaces by dist(s 1, S 2 ) = P 1 P 2 2 where P i is the orthogonal projection onto S i The distance between a pair of subspaces can be characterized in terms of the blocks of a certain orthogonal matrix Theorem Suppose W = [ W 1 W 2 ] Z = [ Z 1 Z 2 ] k n k k n k are n-by-n orthogonal matrices. If S 1 = ran(w 1 ), and S 2 = ran(z 1 ), then dist(s 1, S 2 ) = W 1 Z 2 2 = Z 1 W 2 2 See Golub and Van Loan for proof 7 / 22

Distance between subspaces in IR n If S 1 and S 2 are subspaces in IR n with the same dimension, then 0 dist(s 1, S 2 ) 1 The distance is zero if S 1 = S 2 and one if S 1 S 2 {0} 8 / 22

Symmetric matrices Consider real, symmetric matrices, A = A, Hessian matrix (second order partial derivatives of a function): y = f (x + x) f (x) + J(x) x + 1 2 x H(x) x where J is the Jacobian matrix covariance matrix for Gaussian distribution The inverse is also symmetric: (A 1 ) = A 1 Eigenvector equation for a symmetric matrix Au k = λ k u k which can be written as AU = DU, or (A D)U = 0 where D is a diagonal matrix whose elements are eigenvalues λ 1 D =... λ m and U is matrix whose columns are eigenvectors u k 9 / 22

Eigenvectors for symmetric matrices The eigenvectors can be computed from determinant A D = 0 Eigenvectors can be chosen to form an orthonormal basis as follows For a pair of eigenvectors u j and u k, it follows and since A is symmetric, we have u j Au k = λ k u j u k u k Au j = λ j u k u j (λ k λ j )u k u j = 0 For λ k λ j, the eigenvectors must be orthogonal Note for any u k with eigenvalue λ k, βu k is also an eigenvector for non-zero β with the same eigenvalue Can be used to normalize the eigenvectors to unit norm so that u k u j = δ kj 10 / 22

Symmetric matrices and diagonalization Since Au k = λ k u k, multiply A 1 and we obtain A 1 u k = λ 1 k u k so A 1 has the same eigenvectors as A but with reciprocal eigenvalues For symmetric matrix A, AU = DU and U U = I, U = [u 1,..., u m ], A can be diagonalized U AU = D For symmetric matrix A, the SVD of A = UΣU Recall U, V are left and right singular vectors (AA )U = ΣU (A A)V = ΣV Since A is symmetric, U = V, and A = UΣU 11 / 22

Principal component analysis (PCA) Arguably the most popular dimensionality reduction algorithm Curse of dimensionality Widely used in computer vision, machine learning and pattern recognition Can be derived from several perspectives: Minimize reconstruction error: Karhunen-Loeve transform Decorrelation: Hottelling transform Maximize the variance of the projected samples (i.e., preserve as much energy as possible) Unsupervised learning Linear transform Second order statistics Recall from SVD we have A = UΣV, and thus project samples on the subspace spanned by U can be computed by U A = ΣV 12 / 22

Principal component analysis Given a set of n data points x IR m, we would like to project each x (k) onto a onto a d-dimensional subspace z (k) = [z 1,..., z d ] IR d, d < m, so that d x = z i u i i=1 where the vectors u i satisfy the orthonormality relation u i u j = δ ij in which δ ij is the Kronecker delta. Thus, z i = u i x Now we have only a subset d < m of the basis vector u i. The remaining coefficients will be replaced by constants b i so that each vector x is approximated by x can be approximated by d m x = z i u i + b i u i i=1 i=d+1 13 / 22

Principal component analysis (cont d) Dimensionality reduction: x has m degree of freedom and z has d degree of freedom, d < m For each x (k), the error introduced by the dimensionality reduction is x (k) x (k) = m (z (k) i i=d+1 b i )u i and we want to find the basis vector u i, the coefficients b i, and the values z i with minimum error in l 2 -norm For the whole data set, with orthonormality relation E d = 1 2 n x (k) x (k) 2 = 1 2 k=1 n m k=1 i=d+1 (z (k) i b i ) 2 14 / 22

Principal component analysis (cont d) Take derivative of E d with respect to b i and set it to zero, b i = 1 n z (k) i = 1 n u i x (k) = u i x where, x = 1 n n n n k=1 k=1 Plug it into the sum of square errors, E d, E d = 1 2 m i=d+1 n k=1 (u i (x (k) x)) 2 = n 2 m i=d+1 u i Cu i where C is a covariance matrix C = 1 n (x (k) x)(x (k) x) n k=1 Minimizing E d with respect to u i, we get Cu i = λ i u i k=1 x (k) i.e., the basis vectors u i are the eigenvectors of the covariance matrix C 15 / 22

Derivation Minimizing E d with respect to u i, E d = 1 2 m i=d+1 n k=1 (u i (x (k) x)) 2 = n 2 m i=d+1 u i Cu i Need some constraints to solve this optimization problem Impose orthonormal constraints among u i Use Lagrange multipliers φ ij Ê d = 1 2 m i=d+1 u i Cu i 1 2 m m i=d+1 j=d+1 φ ij (u i u j δ ij ) Recall min f (x) s.t. g(x) = 0 L(x, φ) = f (x) + φg(x) Example: min f (x 1, x 2 ) = x 1 x 2 subject to g(x 1, x 2 ) = x 1 + x 2 1 = 0 16 / 22

Derivation (cont d) In matrix form, Ê d = 1 2 tr{u CU} 1 2 tr{m(u U I )} where M is a matrix with elements φ ij, and U is a matrix whose columns are u i Minimizing Ê d with respect to U, (C + C )U U(M + M ) = 0 Note C is symmetric, M is symmetric since UU is symmetric. Thus CU = UM U CU = M Clearly one solution is to choose M to be diagonal so that the columns of U are eigenvectors of C and the diagonal elements of M are eigenvalues 17 / 22

Derivation (cont d) The eigenvector equation for M MΨ = ΨΛ where Λ is a diagonal matrix of eigenvalues and Ψ is the matrix of eigenvectors M is symmetric and Ψ can be chosen to have orthonormal columns, i.e., Ψ Ψ = I Λ = Ψ MΨ Put together, where Ũ = UΨ, and U Λ = Ψ U CUΨ = (UΨ) C(UΨ) = Ũ CŨ = ŨΨ Another solution for U CU = M can be obtained from the particular solution Ũ by application of an orthogonal transformation given by Ψ 18 / 22

Derivation (cont d) We note that E d is invariant under this orthogonal transformation E d = 1 2 tr{u CU} = 1 2 tr{ψũ CŨΨ } = 1 2tr{Ũ CŨ} Recall the matrix 2-norm is invariant under orthogonal transformation Since all of the possible solutions give the same minimum error E d, we can choose whichever is most convenient We thus choose the solutions given by Ũ (with unit norm) since this has columns which are the eigenvectors of C 19 / 22

Computing principal components from data Minimizing E d with respect to u i, we get Cu i = λ i u i i.e., the basis vectors u i are the eigenvectors of the covariance matrix C Consequently, the error of E d is E d = 1 2 m i=d+1 In other words, the minimum error is reached by discarding the eigenvectors corresponding to the m d smallest eigenvalues Retain the eigenvectors corresponding to the largest eigenvalues λ i 20 / 22

Computing principal components from data Project x (k) onto these eigenvectors give the components of the transformed vector z (k) in the d-dimensional space Each two-dimensional data point is transformed to a single variable z 1 representing the projection of the data point onto the eigenvector u 1 Infer the structure (or reduce redundancy) inherent in high dimensional data Parsimonious representation Linear dimensionality algorithm based on sum-of-square-error criterion Other criteria: covariance measure and population entropy 21 / 22

Intrinsic dimensionality A data set in m dimensions has intrinsic dimensionality equal to m if the data lies entirely within a m -dimensional space What is the intrinsic dimensionality of data? The intrinsic dimensionality may increase due to noise PCA, as a linear approximation, has its limitation How to determine the number of eigenvectors? Empirically determined based on reconstruction error (i.e., energy) 22 / 22