Algebra of Principal Component Analysis

Similar documents
4. Ordination in reduced space

Chapter 11 Canonical analysis

8. FROM CLASSICAL TO CANONICAL ORDINATION

1.2. Correspondence analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Principal Components Analysis (PCA)

Unconstrained Ordination

Analyse canonique, partition de la variation et analyse CPMV

Dissimilarity and transformations. Pierre Legendre Département de sciences biologiques Université de Montréal

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

Ordination & PCA. Ordination. Ordination

Principal Component Analysis (PCA) Theory, Practice, and Examples

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

1 Interpretation. Contents. Biplots, revisited. Biplots, revisited 2. Biplots, revisited 1

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Multivariate Ordination Analyses: Principal Component Analysis. Dilys Vela

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

Bootstrapping, Randomization, 2B-PLS

CS 246 Review of Linear Algebra 01/17/19

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Data Screening and Adjustments. Data Screening for Errors

Methods for sparse analysis of high-dimensional data, II

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Analysis of Multivariate Ecological Data

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Homework 2. Solutions T =

Probabilistic Latent Semantic Analysis

Principal Component Analysis

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Principal component analysis (PCA) for clustering gene expression data

Covariance and Principal Components

15 Singular Value Decomposition

Experimental Design and Data Analysis for Biologists

Statistical Pattern Recognition

Analysis of Multivariate Ecological Data

Principal component analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Review of Linear Algebra

Eigenvalues, Eigenvectors, and an Intro to PCA

Canonical analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

Example Linear Algebra Competency Test

Eigenvalues, Eigenvectors, and an Intro to PCA

Data Preprocessing Tasks

CSE 554 Lecture 7: Alignment

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

Principal Components Theory Notes

Introduction to Machine Learning

Lecture 1: Systems of linear equations and their solutions

Vectors and Matrices Statistics with Vectors and Matrices

Methods for sparse analysis of high-dimensional data, II

1 Last time: least-squares problems

Discriminant Analysis for Interval Data

Basics of Multivariate Modelling and Data Analysis

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Linear Algebra And Its Applications Chapter 6. Positive Definite Matrix

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

BIO 682 Multivariate Statistics Spring 2008

Table of Contents. Multivariate methods. Introduction II. Introduction I

NONLINEAR REDUNDANCY ANALYSIS AND CANONICAL CORRESPONDENCE ANALYSIS BASED ON POLYNOMIAL REGRESSION

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Singular Value Decomposition and Principal Component Analysis (PCA) I

EECS490: Digital Image Processing. Lecture #26

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Introduction to Matrix Algebra

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

1. Vectors.

PCA, Kernel PCA, ICA

7 Principal Components and Factor Analysis

1 Data Arrays and Decompositions

Using Principal Component Analysis Modeling to Monitor Temperature Sensors in a Nuclear Research Reactor

Math 18, Linear Algebra, Lecture C00, Spring 2017 Review and Practice Problems for Final Exam

Principal Component Analysis

Multivariate analysis of genetic data: an introduction

Basic Concepts in Matrix Algebra

STATISTICS 407 METHODS OF MULTIVARIATE ANALYSIS TOPICS

Data Mining: Data. Lecture Notes for Chapter 2. Introduction to Data Mining

Numerical Linear Algebra Homework Assignment - Week 2

Applied Multivariate Statistical Analysis Richard Johnson Dean Wichern Sixth Edition

STAT 501 Assignment 1 Name Spring Written Assignment: Due Monday, January 22, in class. Please write your answers on this assignment

Principal Component Analysis

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Introduction to multivariate analysis Outline

Preprocessing & dimensionality reduction

Review (Probability & Linear Algebra)

Linear Algebra Methods for Data Mining

Transcription:

Algebra of Principal Component Analysis 3 Data: Y = 5 Centre each column on its mean: Y c = 7 6 9 y y = 3..6....6.8 3. 3.8.6 Covariance matrix ( variables): S = -----------Y n c ' Y 8..6 c =.6 5.8 Equation for eigenvalues and eigenvectors of S : (S k I) u k = Eigenvalues: = 9, = 5 Matrix of eigenvalues: = 9 5 Matrix of eigenvectors: U =.89.7.7.89 Positions of the 5 objects in ordination space: F = y y U -. -.5..5. F = 3..6....6.8 3. 3.8.6.89.7.7.89 = 3.578.3.36.3.36 3.3.36 3.3.36 Var 3 5 Var - -

Principal component analysis (PCA) 393 y y (y y ) (a) (b) 6 6 (y y ) 6 8 y 6 8 y y (y y ) II (c) 6 II (d) I 6 3' 6 8 I (y y ) y Figure 9. Numerical example of principal component analysis. (a) Five objects are plotted with respect to descriptors y and y. (b) After centring the data, the objects are now plotted with respect to y y and y y, represented by dashed axes. (c) The objects are plotted with reference to principal axes I and II, which are centred with respect to the scatter of points. (d) The two systems of axes (b and c) can be superimposed after a rotation of 6 3'.

6 II y II (a) II y (b) I y y = 76 35' I I y y 6 3 Fig. 9.3 Numerical example from Fig. 9.. Distance and correlation biplots are discussed in Subsection 9... (a) Distance biplot. The eigenvectors are scaled to lengths. Inset: descriptors (matrix U). Main graph: descriptors (matrix U; arrows) and objects (matrix F; dots). The interpretation of the object-descriptor relationships is not based on their proximity, but on orthogonal projections (dashed lines) of the objects on the descriptor-axes or their extensions. (b) Correlation biplot. Descriptors (matrix U / ; arrows) with a covariance angle of 76 35'. Objects (matrix G; dots). Projecting the objects orthogonally on a descriptor (dashed lines) reconstructs the values of the objects along that descriptors, to within a multiplicative constant. Use the following matrices to draw biplots Distance biplot (scaling ): objects = F, variables = U Correlation biplot (scaling ): objects = G = F /, variables = U sc = U / These two projections respect the biplot rule, that the product of the two projected matrices reconstruct the data Y: Distance biplot: FU' = Y Correlation biplot: G(U / )' = Y

PCA example, three variables # Create the data matrix data.3 <- matrix(c(,3,5,7,9,,,,,6,,5,,,-,-,,),6,3) data.3 [,] [,] [,3] [,] [,] 3 [3,] 5 - [,] 7 6 - [5,] 9 [6,] 5 # Centre the variables data.cent <- scale(data.3, center=true, scale=false) data.cent [,] [,] [,3] [,] - - [,] -3 [3,] - -3 - [,] 3 - [5,] 3 - [6,] # Compute the covariance matrix data.cov <- cov(data.cent) # or, because the data are not standardized: cov(data.3) [,] [,] [,3] [,]. 3.. [,] 3. 5.6. [3,]...8 # Compute the eigenvalues and eigenvectors data.eig <- eigen(data.cov) data.eig $values []...8 $vectors [,] [,] [,3] [,].897.736 [,].736 -.897 [3,].. # Compute the output matrices for scaling U <- data.eig$vectors F.mat <- data.cent %*% U # Compute the output matrices for scaling U.sc <- U %*% diag(data.eig$values^(.5)) G.mat <- F.mat %*% diag(data.eig$values^(-.5))

# Draw the scaling and biplots par(mfrow=c(,)) biplot(f.mat[,c(,)], U[,c(,)]) biplot(f.mat[,c(,3)], U[,c(,3)]) biplot(g.mat[,c(,)], U.sc[,c(,)]) biplot(g.mat[,c(,3)], U.sc[,c(,3)]) Scaling biplots, axes (, ) and (, 3) -.5..5 -. -.5..5. Axis - - 3 Var 3 5 Var Var 6 -.5..5 Axis 3 - - Var 3 5 Var 3 6 Var -. -.5..5. - - Axis - - Axis Scaling biplots, axes (, ) and (, 3) -3 - - 3-3 - - 3 Axis -. -.5..5. 3 Var 3 Var 5 Var 6-3 - - 3 Axis 3 -. -.5..5. 5 Var 3 Var Var 3 6-3 - - 3 -. -.5..5. Axis -. -.5..5. Axis

Data transformations Transform physical variables (Ecology) or characters (Taxonomy) Univariate distributions are not symmetrical! Apply skewness-reduction transformation Variables are not in the same physical units y i y y i y min! Apply standardization z i = ------------ or ranging y' s i = ----------------------------- y y max y min Transform community composition data (Ecology) (species presence-absence or abundance) Reduce asymmetry of distributions! Apply log(y + c) transformation Make community composition data suitable for Euclidean-based ordination methods (PCA, RDA)! Use the chord, chi-square, profile, or Hellinger transformations (Legendre & Gallagher )

Some uses of principal component analysis (PCA) Two-dimensional ordination of the objects: - Sampling sites in ecology - Individuals or taxa in taxonomy A -dimensional ordination diagram is an interesting graphical support for representing other properties of multivariate data, e.g., clusters. Detect outliers or erroneous data in data tables Find groups of variables that behave in the same way: - Species in ecology - Morphological/behavioural/molecular variables in taxonomy Simplify (collinear) data; remove noise Remove an identifiable component of variation e.g., size factor in log-transformed morphological data

Algebra of Correspondence Analysis f i+ Frequency data table Y = f ij = 5 5 5 5 35 5 f +j = 35 3 35 = f ++ p ij = f ij / f ++ p i+ = f i+ / f ++ p +j = f +j / f ++ p Matrix Q q ij p i+ p = +j ij = ------------------ = p i+ p +j O ij E ij E ij ---------------------------- f ++ Matrix Q =.69.577.636.69.3887.69.9.99.667 Cross-product matrix: Q'Q =.6..398..395.66.398.66.59 Compute eigenvalues and eigenvectors of Q'Q : ( Q'Q k I) u k = Eigenvalues: =.96, =. Matrix of eigenvalues: =.96. There are never more than k = min(r, c ) eigenvalues > in CA Matrix of eigenvectors of Q'Q(c c) : U (c k) = Matrix of eigenvectors of QQ' (r r) : Û(r k) = QU / =.786.336.383.85.59.579.53693.5583.33.7956.8339.356

Compute matrices F and V for scaling biplot, and Vˆ and Fˆ for scaling biplot: CA biplot scaling type CA biplot scaling type Sp.3 Site_ Site_ Sit e_ Site_3 Sp. Sp. 3 Sp. Sp. Site_3 Sp. Site_ -.5 -. -.5..5.. 5. CA axis -. -.5..5..5. CA axis Calculation details Compute matrices V, Vˆ, F, and Fˆ used in the ordination biplots: V (c k) = D(p +j ) / U where p +j = f +j /f ++ Vˆ (r k) = D(p i+ ) / Û where p i+ = f i+ /f ++ F (r k) = Vˆ / Fˆ (c k) = V / Biplot, scaling type : plot F for sites, V for species: This projection preserves the chi-square distance among the sites. The sites are at the centroids (barycentres) of the species. Biplot, scaling type : plot Vˆ for sites, Fˆ for species: This projection preserves the chi-square distance among the species. The species are at the centroids (barycentres) of the sites.

Principal coordinate analysis (PCoA) Example: a Euclidean distance matrix computed from the data of the PCA example. Transform D to a new matrix A = [a hi ]:. 3.68 3.68 7.77 7.77 3.68..7.7 6.356 D = 3.68.7. 6.356.7 7.77.7 6.356..7 7.77 6.356.7.7. a =.5D hi hi Centre matrix A to produce matrix Δ with sums of the rows and columns equal to : Δ = [δ hi ] = [ a hi a h ai + a].8.8.8...8 6.8 3..8 9. Δ =.8 3. 6.8 9..8..8 9..8.8. 9..8.8.8 Compute the eigenvalues and eigenvectors of matrix Δ. Scale the eigenvectors to lengths equal to the square roots of their respective eigenvalues, u' k u k = λ k. Eigenvalues: λ = 36 λ = Objects Eigenvectors x 3.578. x.3.36 x.3.36 x 3.3.36 x 3.3.36 Eigenvector length 36 = 6. =.7 PCoA eigenvalues / (n ) 9 5 Compare the PCoA eigenvalues and eigenvectors to the PCA eigenvalues and matrix F. PCoA is used for ordination of D matrices produced by functions other than the Euclidean.