Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Similar documents
Principal Components Theory Notes

Computational functional genomics

PCA and admixture models

Principal component analysis

Linear Dimensionality Reduction

18.S096 Problem Set 7 Fall 2013 Factor Models Due Date: 11/14/2013. [ ] variance: E[X] =, and Cov[X] = Σ = =

Second-Order Inference for Gaussian Random Curves

Principal Component Analysis

Statistical Machine Learning

Lecture 7 Spectral methods

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis (PCA)

MLCC 2015 Dimensionality Reduction and PCA

PCA, Kernel PCA, ICA

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Principal Component Analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Tutorial on Principal Component Analysis

1 Singular Value Decomposition and Principal Component

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Motivating the Covariance Matrix

STATISTICAL LEARNING SYSTEMS

Linear Algebra using Dirac Notation: Pt. 2

Sparse PCA in High Dimensions

Regularized Discriminant Analysis and Reduced-Rank LDA

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Chapter 4: Factor Analysis

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

1 Principal Components Analysis

A Peak to the World of Multivariate Statistical Analysis

PCA: Principal Component Analysis

Exercises * on Principal Component Analysis

An introduction to multivariate data

Data Preprocessing Tasks

Principal Component Analysis

Mathematical foundations - linear algebra

CSC 411 Lecture 12: Principal Component Analysis

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

The Multivariate Gaussian Distribution

Matrix Vector Products

Unsupervised Learning: Dimensionality Reduction

Chapter 3 Transformations

What is Principal Component Analysis?

Introduction to multivariate analysis Outline

EECS 275 Matrix Computation

Principal Component Analysis (PCA)

Class 8 Review Problems 18.05, Spring 2014

Introduction to Machine Learning

Singular Value Decomposition and Principal Component Analysis (PCA) I

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Methods for sparse analysis of high-dimensional data, II

Def. The euclidian distance between two points x = (x 1,...,x p ) t and y = (y 1,...,y p ) t in the p-dimensional space R p is defined as

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

Notes on Random Vectors and Multivariate Normal

Machine Learning - MT & 14. PCA and MDS

Multivariate Statistical Analysis

Pattern Recognition 2

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

PRINCIPAL COMPONENT ANALYSIS

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Principal Component Analysis and Linear Discriminant Analysis

Dimensionality Reduction

Numerical Methods I Singular Value Decomposition

Linear Algebra Methods for Data Mining

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Intelligent Data Analysis. Principal Component Analysis. School of Computer Science University of Birmingham

Covariance and Correlation Matrix

Face Recognition. Face Recognition. Subspace-Based Face Recognition Algorithms. Application of Face Recognition

Lecture 19 (Nov. 15, 2017)

9.1 Orthogonal factor model.

STAT 100C: Linear models

Lecture 3: Review of Linear Algebra

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Data Mining and Analysis: Fundamental Concepts and Algorithms

Lecture 3: Review of Linear Algebra

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Multivariate Analysis

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

PROBLEMS ON LINEAR ALGEBRA

CS281 Section 4: Factor Analysis and PCA

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

q n. Q T Q = I. Projections Least Squares best fit solution to Ax = b. Gram-Schmidt process for getting an orthonormal basis from any basis.

Discriminant analysis and supervised classification

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Econ 2120: Section 2

High Dimensional Covariance and Precision Matrix Estimation

Linear Algebra in a Nutshell: PCA. can be seen as a dimensionality reduction technique Baroni & Evert. Baroni & Evert

Gaussian random variables inr n

Principal component analysis (PCA) for clustering gene expression data

Principal Component Analysis

Advanced data analysis

Basic Concepts in Matrix Algebra

Methods for sparse analysis of high-dimensional data, II

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Transcription:

Statistics for Applications Chapter 9: Principal Component Analysis (PCA) 1/16

Multivariate statistics and review of linear algebra (1) Let X be a d-dimensional random vector and X 1,..., X n be n independent copies of X. Write X i = (X 1,..., X, i = 1,..., n. i d i )T Denote by X the random n d matrix X T 1 X =.. X T n 2/16

Multivariate statistics and review of linear algebra (2) Assume that E[IXI 2 2 ] <. Mean of X: T E[X] = E[X 1 ],..., E[X d ]. Covariance matrix of X: the matrix Σ = (σ j,k ) j,k=1,...,d, where σ j,k = cov(x j, X k ). It is easy to see that Σ = E[XX T ] E[X]E[X] T = E (X E[X])(X E[X]) T. 3/16

Multivariate statistics and review of linear algebra (3) Empirical mean of X 1,..., X n : 1 n T 1 d X = X i = X,..., X. n i=1 Empirical covariance of X 1,..., X n : the matrix S = (s j,k ) j,k=1,...,d where s j,k is the empirical covariance of the X i j, X i k, i = 1..., n. It is easy to see that n 1 1 ( ) ( ) T S = X i X T T i X X = X i X X i X. n n i=1 i=1 n 4/16

Multivariate statistics and review of linear algebra (4) Note that X = 1 X T 1, where 1 = (1,..., 1) T R d. n Note also that 1 1 1 S = X T X X11 T X = X T HX, n n2 n where H = I n 1 11 T. n H is an orthogonal projector: H 2 = H, H T = H. (on what subspace?) If u R d, u T Σu is the variance of u T X; u T Su is the sample variance of u T X 1,..., u T X n. 5/16

Multivariate statistics and review of linear algebra (5) In particular, u T Su measures how spread (i.e., diverse) the points are in direction u. If u T Su = 0, then all X i s are in an affine subspace orthogonal to u. If u T Σu = 0, then X is almost surely in an affine subspace orthogonal to u. If u T Su is large with IuI 2 = 1, then the direction of u explains well the spread (i.e., diversity) of the sample. 6/16

Multivariate statistics and review of linear algebra (6) In particular, Σ and S are symmetric, positive semi-definite. Any real symmetric matrix A R d d has the decomposition A = P DP T, where: P is a d d orthogonal matrix, i.e., P P T = P T P = I d ; D is diagonal. The diagonal elements of D are the eigenvalues of A and the columns of P are the corresponding eigenvectors of A. A is semi-definite positive iff all its eigenvalues are nonnegative. 7/16

Principal Component Analysis: Heuristics (1) The sample X 1,..., X n makes a cloud of points in R d. In practice, d is large. If d > 3, it becomes impossible to represent the cloud on a picture. Question: Is it possible to project the cloud onto a linear subspace of dimension d ' < d by keeping as much information as possible? Answer: PCA does this by keeping as much covariance structure as possible by keeping orthogonal directions that discriminate well the points of the cloud. 8/16

Principal Component Analysis: Heuristics (2) Idea: Write S = P DP T, where P = (v 1,..., v d ) is an orthogonal matrix, i.e., T Iv j I 2 = 1, v j v k = 0, j = k. λ 1 λ 2 0 D =..., with λ 1... λ d 0.. 0.. λ d Note that D is the empirical covariance matrix of the P T X i s, i = 1,..., n. In particular, λ 1 is the empirical variance of the v T 1 X i s; λ 2 is the empirical variance of the v T 2 X i s, etc... 9/16

Principal Component Analysis: Heuristics (3) So, each λ j measures the spread of the cloud in the direction v j. In particular, v 1 is the direction of maximal spread. Indeed, v 1 maximizes the empirical covariance of a T X 1,..., a T X n over a R d such that IaI 2 = 1. Proof: For any unit vector a, show that with equality if a = v 1. T a T Σa = P T a D P T a λ 1, 10/16

Principal Component Analysis: Main principle Idea of the PCA: Find the collection of orthogonal directions in which the cloud is much spread out. Theorem v 1 argmax u T Su, lul=1 v 2 argmax u T Su, lul=1,u v 1 v d argmax u T Su. lul=1,u v j,j=1,...,d 1 Hence, the k orthogonal directions in which the cloud is the most spread out correspond exactly to the eigenvectors associated with the k largest values of S. 11/16

Principal Component Analysis: Algorithm (1) 1. Input: X 1,..., X n : cloud of n points in dimension d. 2. Step 1: Compute the empirical covariance matrix. 3. Step 2: Compute the decomposition S = P DP T, where D = Diag(λ 1,..., λ d ), with λ 1 λ 2... λ d and P = (v 1,..., v d ) is an orthogonal matrix. 4. Step 3: Choose k < d and set P k = (v 1,..., v k ) R d k. 5. Output: Y 1,..., Y n, where Question: How to choose k? Y i = P k T X i R k, i = 1,..., n. 12/16

Principal Component Analysis: Algorithm (2) Question: How to choose k? Experimental rule: Take k where there is an inflection point in the sequence λ 1,..., λ d (scree plot). Define a criterion: Take k such that λ 1 +... + λ k 1 α, λ 1 +... + λ d for some α (0, 1) that determines the approximation error that the practitioner wants to achieve. Remark: λ 1 +... + λ k is called the variance explained by the PCA and λ 1 +... + λ d = T r(s) is the total variance. Data visualization: Take k = 2 or 3. 13/16

Example: Expression of 500,000 genes among 1400 Europeans Reprinted by permission from Macmillan Publishers Ltd: Nature. Source: John Novembre, et al. "Genes mirror geography within Europe." Nature 456 (2008): 98-101. 2008. 14/16

Principal Component Analysis - Beyond practice (1) PCA is an algorithm that reduces the dimension of a cloud of points and keeps its covariance structure as much as possible. In practice this algorithm is used for clouds of points that are not necessarily random. In statistics, PCA can be used for estimation. If X 1,..., X n are i.i.d. random vectors in R d, how to estimate their population covariance matrix Σ? If n» d, then the empirical covariance matrix S is a consistent estimator. In many applications, n «d (e.g., gene expression). Solution: sparse PCA 15/16

Principal Component Analysis - Beyond practice (2) It may be known beforehand that Σ has (almost) low rank. Then, run PCA on S: Write S S ', where λ 1 λ 2 0.. S '. = P λk 0. 0.. 0 P T. S ' will be a better estimator of S under the low-rank assumption. A theoretical analysis would lead to an optimal choice of the tuning parameter k. 16/16

MIT OpenCourseWare https://ocw.mit.edu 18.650 / 18.6501 Statistics for Applications Fall 2016 For information about citing these materials or our Terms of Use, visit: https://ocw.mit.edu/terms.