Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Similar documents
Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Multivariate Statistical Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Introduction to Machine Learning

LECTURE 16: PCA AND SVD

Mathematical foundations - linear algebra

15 Singular Value Decomposition

14 Singular Value Decomposition

CS281 Section 4: Factor Analysis and PCA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Subspace Models

Singular Value Decomposition

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

Notes on Implementation of Component Analysis Techniques

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

PRINCIPAL COMPONENTS ANALYSIS

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Numerical Methods I Singular Value Decomposition

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

1 Linearity and Linear Systems

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

7 Principal Component Analysis

Expectation Maximization

Linear Methods in Data Mining

Singular Value Decompsition

Linear Least Squares. Using SVD Decomposition.

Lecture: Face Recognition and Feature Reduction

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Unsupervised dimensionality reduction

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Singular Value Decomposition

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

COMP 558 lecture 18 Nov. 15, 2010

Foundations of Computer Vision

Lecture 7 Spectral methods

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis

Positive Definite Matrix

Lecture 02 Linear Algebra Basics

1 Singular Value Decomposition and Principal Component

Matrix Vector Products

The Singular Value Decomposition

Principal Component Analysis (PCA)

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Jeffrey D. Ullman Stanford University

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Linear Algebra Methods for Data Mining

Statistical Pattern Recognition

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

Matrix decompositions

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Learning with Singular Vectors

ECE295, Data Assimila0on and Inverse Problems, Spring 2015

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

1 Data Arrays and Decompositions

Applied Linear Algebra in Geoscience Using MATLAB

Principal Component Analysis

Lecture 5 Singular value decomposition

Chapter 4: Factor Analysis

CS 340 Lec. 6: Linear Dimensionality Reduction

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Singular Value Decomposition

Principal Component Analysis (PCA)

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

IV. Matrix Approximation using Least-Squares

EECS 275 Matrix Computation

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Machine Learning (Spring 2012) Principal Component Analysis

Maximum variance formulation

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Example Linear Algebra Competency Test

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Linear Algebra for Machine Learning. Sargur N. Srihari

Parallel Singular Value Decomposition. Jiaxing Tan

Concentration Ellipsoids

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Tutorial on Principal Component Analysis

Unsupervised learning: beyond simple clustering and PCA

Chapter 3 Transformations

Principal Component Analysis

UNIT 6: The singular value decomposition.

PCA, Kernel PCA, ICA

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Lecture 3: Review of Linear Algebra

Transcription:

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1

Combinations of features Given a data matrix X n p with p fairly large, it can be difficult to visualize structure. Often useful to look at linear combinations of the features. Each β R p, determines a linear rule f β (x) = x T β Evaluating this on each row X i of X yields a vector (f β (X i )) 1 i n = X β. 2 / 1

Dimension reduction By choosing β appropriately, we may find interesting new features. Suppose we take k much smaller than p vectors of β which we write as a matrix B p k The new data matrix XB has fewer dimensions than X. This is dimension reduction... 3 / 1

Principal Components This method of dimension reduction seeks features that maximize sample variance... Specifically, for β R p, define the sample variance of V β = X β R n Var(V β ) = 1 n 1 n n (V β,i V β ) 2 i=1 V β = 1 n i=1 V β,i Note that Var(V Cβ ) = C 2 Var(V β ) so we usually standardize β 2 = n i=1 β2 i = 1 4 / 1

Principal Components Define the n n matrix H = I n n 1 n 11T This matrix removes means: (Hv) i = v i v. It is also a projection matrix: H T = H H 2 = H 5 / 1

Principal Components with Matrices With this matrix, Var(V β ) = 1 n 1 βt X T HX β. So, maximizing sample variance, with β 2 = 1 is ( ) maximize β T X T HX β. β 2 =1 This boils down to an eigenvalue problem... 6 / 1

Eigenanalysis The matrix X T HX is symmetric, so it can be written as X T HX = VDV T where D k k = diag(d 1,..., d k ), rank(x T HX ) = k and V p k has orthonormal columns, i.e. V T V = I k k. We always have d i 0 and we take d 1 d 2... d k. 7 / 1

Eigenanalysis & PCA Suppose now that β = k j=1 a jv j with v j the columns of V. Then, β 2 = k ( ) β T X T HX β = i=1 a 2 i k ai 2 d i i=1 Choosing a 1 = 1, a j = 0, j 2 maximizes this quantity. 8 / 1

Eigenanalysis & PCA Therefore, β 1 = v 1 the first column of V solves ( ) maximize β T X T HX β. β 2 =1 This yields scores HX β 1 R n. These are the 1st principal component scores. 9 / 1

Higher order components Having found the direction with maximal sample variance we might look for the second most variable direction by solving ( ) maximize β T X T HX β. β 2 =1,β T v 1 =0 Note we restricted our search so we would not just recover v 1 again. Not hard to see that if β = k j=1 a jv j this is solved by taking a 2 = 1, a j = 0, j 2. 10 / 1

Higher order components In matrix terms, all the principal components scores are (HX V ) n k and the loadings are the columns of V. This information can be summarized in a biplot. The loadings describe how each feature contributes to each principal component score. 11 / 1

Olympic data 12 / 1

Olympic data: screeplot 13 / 1

The importance of scale The PCA scores are not invariant to scaling of the features: for Q p p diagonal the PCA scores of X Q are not the same as X. Common to convert all variables to the same scale before applying PCA. Define the scalings to be the sample standard deviation of each feature. In matrix form, let S 2 = 1 ( ) n 1 diag X T HX Define X = HX S 1. The normalized PCA loadings are given by an eigenanalysis of X T X. 14 / 1

Olympic data 15 / 1

Olympic data: screeplot 16 / 1

PCA and the SVD The singular value decomposition of a matrix tells us we can write X = U V T with k k = diag(δ 1,..., δ k ), δ j 0, k = rank( X ), U T U = V T V = I k k. Recall that the scores were X V = (U V T )V = U Also, X T X = V 2 V T so D = 2. 17 / 1

PCA and the SVD Recall X = HX S 1/2. We saw that normalized PCA loadings are given by an eigenanalysis of X T X. It turns out that, via the SVD, the normalized PCA scores are given by an eigenanalysis of X X T because ( X V ) n k = U where U are eigenvectors of X X T = U 2 U T. 18 / 1

Another characterization of SVD Given a data matrix X (or its scaled centered version X ) we might try solving minimize Z:rank(Z)=k X Z 2 F where F stands for Frobenius norm on matrices A 2 F = n p i=1 j=1 A 2 ij 19 / 1

Another characterization of SVD It can be proven that Z = S n k (V T ) k p where S n k is the matrix of the first k PCA scores and the columns of V are the first k PCA loadings. This approximation is related to the screeplot: the height of each bar describes the additional drop in Frobenius norm as the rank of the approximation increases. 20 / 1

Olympic data: screeplot 21 / 1

Other types of dimension reduction Instead of maximizing sample variance, we might try maximizing some other quantity... Independent Component Analysis tries to maximize non-gaussianity of V β. In practice, it uses skewness, kurtosis or other moments to quantify non-gaussianity. These are both unsupervised approaches. Often, these are combined with supervised approaches into an algorithm like: Feature creation Build some set of features using PCA. Validate Use the derived features to see if they are helpful in the supervised problem. 22 / 1

Olympic data: 1st component vs. total score 23 / 1

24 / 1