Principal components analysis COMS 4771

Similar documents
Principal Component Analysis

7 Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Introduction to Machine Learning

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Probabilistic Latent Semantic Analysis

14 Singular Value Decomposition

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Advanced Introduction to Machine Learning CMU-10715

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Machine Learning - MT & 14. PCA and MDS

Lecture: Face Recognition and Feature Reduction

Deriving Principal Component Analysis (PCA)

linearly indepedent eigenvectors as the multiplicity of the root, but in general there may be no more than one. For further discussion, assume matrice

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

Manning & Schuetze, FSNLP (c) 1999,2000

A few applications of the SVD

Lecture: Face Recognition and Feature Reduction

PCA and admixture models

Data Mining Techniques

Numerical Methods I Singular Value Decomposition

Dimensionality Reduction

15 Singular Value Decomposition

Multivariate Statistical Analysis

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Linear Methods in Data Mining

Mathematical foundations - linear algebra

Singular Value Decompsition

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

PCA, Kernel PCA, ICA

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Lecture 2: Linear Algebra Review

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Linear Subspace Models

CS 340 Lec. 6: Linear Dimensionality Reduction

Methods for sparse analysis of high-dimensional data, II

Principal Component Analysis

Lecture 5 Singular value decomposition

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Methods for sparse analysis of high-dimensional data, II


What is Principal Component Analysis?

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

LECTURE 16: PCA AND SVD

STA141C: Big Data & High Performance Statistical Computing

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

PCA and LDA. Man-Wai MAK

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Image Analysis & Retrieval Lec 14 - Eigenface & Fisherface

Image Analysis & Retrieval. Lec 14. Eigenface and Fisherface

Problem # Max points possible Actual score Total 120

CSC 411 Lecture 12: Principal Component Analysis

MLCC 2015 Dimensionality Reduction and PCA

Announcements (repeat) Principal Components Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

The Singular Value Decomposition

STA 414/2104: Machine Learning

1 Feature Vectors and Time Series

Dimensionality Reduction

PCA and LDA. Man-Wai MAK

CS 3750 Advanced Machine Learning. Applications of SVD and PCA (LSA and Link analysis) Cem Akkaya

Dimensionality reduction

IV. Matrix Approximation using Least-Squares

Principal Component Analysis

Unsupervised Learning

1 Singular Value Decomposition and Principal Component

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Principal Component Analysis (PCA)

STAT 309: MATHEMATICAL COMPUTATIONS I FALL 2018 LECTURE 9

STA141C: Big Data & High Performance Statistical Computing

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Linear Algebra Review. Vectors

EECS 275 Matrix Computation

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Expectation Maximization

MATH36001 Generalized Inverses and the SVD 2015

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Preprocessing & dimensionality reduction

Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Linear Algebra & Geometry why is linear algebra useful in computer vision?

EECS 275 Matrix Computation

Singular Value Decomposition and Principal Component Analysis (PCA) I

Main matrix factorizations

STA141C: Big Data & High Performance Statistical Computing

Dimensionality Reduction with Principal Component Analysis

The Singular Value Decomposition

Latent semantic indexing

Lecture 3: Review of Linear Algebra

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

FSAN/ELEG815: Statistical Learning

Transcription:

Principal components analysis COMS 4771

1. Representation learning

Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) 1 / 23

Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. 1 / 23

Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. Some previously encontered examples: Feature maps corresponding to pos. def. kernels (+approximations). (Usually data-oblivious feature map doesn t depend on the data.) Centering x x µ (Effect: resulting features have mean 0.) Standardization x diag(σ 1, σ 2,..., σ d ) 1 (x µ). (Effect: resulting features have mean 0 and unit variance.) 1 / 23

Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. Some previously encontered examples: Feature maps corresponding to pos. def. kernels (+approximations). (Usually data-oblivious feature map doesn t depend on the data.) Centering x x µ (Effect: resulting features have mean 0.) Standardization x diag(σ 1, σ 2,..., σ d ) 1 (x µ). (Effect: resulting features have mean 0 and unit variance.) What other properties of a feature representation may be desirable? 1 / 23

2. Principal components analysis

Dimensionality reduction via projections Input: x 1, x 2,..., x n R d, target dimensionality k N. Output: a k-dimensional subspace, represented by an orthonormal basis q 1, q 2,..., q k R d. (Orthogonal) projection: projection of x R d to span(q 1, q 2,..., q k ) is k k q i q T i x = q i, x q i R d. } {{ } Π Can also represent the projection of x in terms of its coefficients w.r.t. the orthonormal basis q 1, q 2,..., q k : q 1, x q 2, x φ(x) :=. Rk. q k, x 2 / 23

Projection of minimum residual squared error Objective: find k-dimensional projector Π: R d R d such that the average residual squared error 1 n x i Πx i 2 2 n is as small as possible. 3 / 23

Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i 2 4 / 23

Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n 2 n x i 2 2 1 qt n n x ix T i q 4 / 23

Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n = 1 n 2 n x i 2 2 1 qt n n x ix T i q n ( ) x i 2 1 2 qt n AT A q (where x T i is i-th row of A R n d ). 4 / 23

Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n = 1 n 2 n x i 2 2 1 qt n n x ix T i q n ( ) x i 2 1 2 qt n AT A q (where x T i is i-th row of A R n d ). 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. 4 / 23

Aside: Eigendecompositions Every symmetric matrix M R d d guaranteed to have eigendecomposition with real eigenvalues: = M V Λ V (d d) (d d) (d d) (d d) = d λ i v i v i real eigenvalues: λ 1 λ 2 λ d (Λ = diag(λ 1, λ 2,..., λ d )); corresponding orthonormal eigenvectors: v 1, v 2,..., v d (V = [v 1 v 2 v d ]). 5 / 23

Aside: Eigendecompositions Every symmetric matrix M R d d guaranteed to have eigendecomposition with real eigenvalues: = M V Λ V (d d) (d d) (d d) (d d) = d λ i v i v i real eigenvalues: λ 1 λ 2 λ d (Λ = diag(λ 1, λ 2,..., λ d )); corresponding orthonormal eigenvectors: v 1, v 2,..., v d (V = [v 1 v 2 v d ]). Fixed-point characterization of eigenvectors: Mv i = λ iv i. 5 / 23

Eigendecompositions Variational characterization of eigenvectors: max q R d qt Mq s.t. q 2 = 1 Maximum value: λ 1 (top eigenvalue) Maximizer: v 1 (top eigenvector) 6 / 23

Eigendecompositions Variational characterization of eigenvectors: max q R d qt Mq s.t. q 2 = 1 Maximum value: λ 1 (top eigenvalue) Maximizer: v 1 (top eigenvector) For i > 1, max q R d qt Mq s.t. q 2 = 1 q, v j = 0 j < i Maximum value: λ i (i-th largest eigenvalue) Maximizer: v i (i-th eigenvector) 6 / 23

Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. 7 / 23

Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. Solution: eigenvector of A T A corresponding to largest eigenvalue (i.e., the top eigenvector v 1). ( ) 1 q T n AT A q = 1 n q, x i 2 n (variance in direction q, assuming 1 n n xi = 0). 7 / 23

Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. Solution: eigenvector of A T A corresponding to largest eigenvalue (i.e., the top eigenvector v 1). ( ) 1 q T n AT A q = 1 n q, x i 2 n (variance in direction q, assuming 1 n n xi = 0). top eigenvector direction of maximum variance 7 / 23

Principal components analysis (general k) General k case (Π = QQ T ) arg min Q R d k : Q T Q=I 1 n n x i QQ T 2 x i 2 arg max Q R d k : Q T Q=I k q T i ( ) 1 n AT A q i. Solution: k eigenvectors of A T A corresponding to k largest eigenvalue 8 / 23

Principal components analysis (general k) General k case (Π = QQ T ) arg min Q R d k : Q T Q=I 1 n n x i QQ T 2 x i 2 arg max Q R d k : Q T Q=I k q T i ( ) 1 n AT A q i. Solution: k eigenvectors of A T A corresponding to k largest eigenvalue k q T i ( ) 1 n AT A q i = k 1 n n q i, x j 2 j=1 (sum of variances in q i directions, assuming 1 n n xi = 0). top k eigenvectors k-dim. subspace of maximum variance 8 / 23

Principal components analysis (PCA) Data matrix A R n d Rank k PCA (k dimensional linear subspace) Get top k eigenvectors V k := [v 1 v 2... v k ] of 1 n AT A = 1 n x ix T i. n Feature map: φ(x) := ( v 1, x, v 2, x,..., v k, x ) R k. Decorrelating property: 1 n n φ(x i)φ(x i) T = diag(λ 1, λ 2,..., λ k ). Approx. reconstruction: x V k φ(x). 9 / 23

Principal components analysis (PCA) Data matrix A R n d Rank k PCA with centering (k dimensional affine subspace) Get top k eigenvectors V k := [v 1 v 2... v k ] of where µ = 1 n n xi. 1 n n (x i µ)(x i µ) T Feature map: φ(x) := ( v 1, x µ, v 2, x µ,..., v k, x µ ) R k. Decorrelating property: 1 n 1 n n φ(x i) = 0 n φ(x i)φ(x i) T = diag(λ 1, λ 2,..., λ k ). Approx. reconstruction: x µ + V k φ(x). 10 / 23

Example: PCA on OCR digits data Data {x i} n from R 784. Fraction of residual variance left by rank-k PCA projection: k j=1 variance in direction vj 1. total variance Fraction of residual variance left by best k coordinate projections: k j=1 variance in direction ej 1. total variance fraction of residual variance 1 0.8 0.6 0.4 0.2 coordinate projections PCA projections 0 0 200 400 600 800 dimension of projections k 11 / 23

Example: compressing digits images 16 16 pixel images of handwritten 3s (as vectors in R 256 ) Mean µ and eigenvectors v 1, v 2, v 3, v 4 Mean λ 1 =3.4 10 5 λ 2 =2.8 10 5 λ 3 =2.4 10 5 λ 4 =1.6 10 5 Reconstructions: x k = 1 k = 10 k = 50 k = 200 Only have to store k numbers per image, along with the mean µ and k eigenvectors (256(k + 1) numbers). 12 / 23

Example: eigenfaces Dimensional Dataof faces (as vectors in R10304 ) 92High 112 pixel images Figure 15.5: 100 training images. Each image con 112 = 10304 greyscale pixels. The train data is sca represented as an image, the components of each ima The average value of each pixel across all images is This is a subset of the 400 images in the full Olive Face Database. 100 example images top k = 48 eigenvectors Figure 15.6: (a): SVD tion of the images 13in/ 23fig

Other examples x R d : movement of stock prices for d different stocks in one day. 14 / 23

Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. 14 / 23

Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) 14 / 23

Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) Principal components: major personality axes in a population (e.g., extroversion, agreeableness, conscientiousness ) 14 / 23

Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) Principal components: major personality axes in a population (e.g., extroversion, agreeableness, conscientiousness )... 14 / 23

3. Computation

Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. 15 / 23

Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. Theorem: For any ε (0, 1), with high probability (over choice of initial ˆv), ˆv T A T Aˆv (1 ε) top eigenvalue of A T A ( 1 after O ε log d ) iterations. ε 15 / 23

Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. Theorem: For any ε (0, 1), with high probability (over choice of initial ˆv), ˆv T A T Aˆv (1 ε) top eigenvalue of A T A ( 1 after O ε log d ) iterations. ε Similar algorithm can be used to get top k eigenvectors. 15 / 23

4. Singular value decomposition

Singular value decomposition Every matrix A R n d has a singular value decomposition (SVD) = A U S V (n d) (n r) (r r) (r d) = r s i u i v i where r = rank(a) (r min{n, d}); U T U = I (i.e., U = [u 1 u 2 u r] has orthonormal columns) left singular vectors; S = diag(s 1, s 2,..., s r) where s 1 s 2 s r > 0 singular values; V T V = I (i.e., V = [v 1 v 2 v r] has orthonormal columns) right singular vectors. 16 / 23

SVD vs PCA If SVD of A is USV T = r siuivt i, then: non-zero eigenvalues of A T A are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are v 1, v 2,..., v r R d (right singular vectors of A). 17 / 23

SVD vs PCA If SVD of A is USV T = r siuivt i, then: non-zero eigenvalues of A T A are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are v 1, v 2,..., v r R d (right singular vectors of A). By symmetry, also have: non-zero eigenvalues of AA T are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are u 1, u 2,..., u r R n (left singular vectors of A). 17 / 23

Low-rank SVD For any k rank(a), rank-k SVD approximation: Û k Ŝ k V k (n k) (k k) (k d) = k s i u i v i (Just retain top k left/right singular vectors and singular values from SVD.) 18 / 23

Low-rank SVD For any k rank(a), rank-k SVD approximation: Û k Ŝ k V k (n k) (k k) (k d) = k s i u i v i (Just retain top k left/right singular vectors and singular values from SVD.) Best rank-k approximation: Â := Û V T kŝk k = arg min M R n d : rank(m) k Minimum value is simply given by n d n j=1 j=1(a i,j Âi,j)2 = t>k d (A i,j M i,j) 2. s 2 t. 18 / 23

Example: latent semantic analysis Represent corpus of documents by counts of words they contain: document 1 document 2 document 3 aardvark abacus abalone 3 0 0 7 0 4 2 4 0.... One column per vocabulary word in A R n d One row per document in A R n d A i,j = numbers of times word j appears in document i. 19 / 23

Example: latent semantic analysis Statistical model for document-word count matrix. 20 / 23

Example: latent semantic analysis Statistical model for document-word count matrix. Parameters θ = (β 1, β 2,..., β k, π 1, π 2,..., π n, l 1, l 2,..., l n). k min{n, d} topics, each represented by a distributions over vocabulary words: β 1, β 2,..., β k R d +. Each β t = (β t,1, β t,2,..., β t,d ) is a probability vector, so d j=1 βt,j = 1. Each document i is associated with a probability distribution π i = (π i,1, π i,2,..., π i,k ) over topics, so k t=1 πi,t = 1. 20 / 23

Example: latent semantic analysis Statistical model for document-word count matrix. Parameters θ = (β 1, β 2,..., β k, π 1, π 2,..., π n, l 1, l 2,..., l n). k min{n, d} topics, each represented by a distributions over vocabulary words: β 1, β 2,..., β k R d +. Each β t = (β t,1, β t,2,..., β t,d ) is a probability vector, so d j=1 βt,j = 1. Each document i is associated with a probability distribution π i = (π i,1, π i,2,..., π i,k ) over topics, so k t=1 πi,t = 1. Model posits that document i s count vector (i-th row in A) follows a multinomial distribution with probabilities given by k [A i,1 A i,2... A i,d ] Expected value is l i k t=1 πi,tβt t. Multinomial l i, t=1 πi,tβ t: k t=1 π i,tβ T t. 20 / 23

Example: latent semantic analysis Suppose A P θ. 21 / 23

Example: latent semantic analysis Suppose A P θ. In expectation, A has rank k: l 1π T 1 β T 1 l 2π T 2 β T 2 E(A) =... } l nπ T n {{ }} β T k {{ } n k k d 21 / 23

Example: latent semantic analysis Suppose A P θ. In expectation, A has rank k: l 1π T 1 β T 1 l 2π T 2 β T 2 E(A) =... } l nπ T n {{ }} β T k {{ } n k k d Observed matrix A: A = E(A) + Zero mean noise so A is generally of rank min{n, d} k. 21 / 23

Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) 22 / 23

Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. 22 / 23

Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. New document feature representation very useful for information retrieval. (Example: cosine similarities between documents become faster to compute and possibly less noisy.) 22 / 23

Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. New document feature representation very useful for information retrieval. (Example: cosine similarities between documents become faster to compute and possibly less noisy.) Actually estimating π i and β t takes a bit more work. 22 / 23

Recap PCA: directions of maximum variance in data subspace that minimizes residual squared error. Computation: power method SVD: general decomposition for arbitrary matrices Low-rank SVD: best low-rank approximation of a matrix in terms of average squared errors PCA/SVD: often useful when low-rank structure is expected (e.g., probabilistic modeling). 23 / 23