Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Similar documents
1 Principal Components Analysis

Linear Dimensionality Reduction

Introduction to Machine Learning

Computation. For QDA we need to calculate: Lets first consider the case that

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Principal Components Analysis. Sargur Srihari University at Buffalo

What is Principal Component Analysis?

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

STAT 100C: Linear models

Data Mining and Analysis: Fundamental Concepts and Algorithms

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal Components Analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Standardization and Singular Value Decomposition in Canonical Correlation Analysis

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Learning with Singular Vectors

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

STATISTICAL LEARNING SYSTEMS

Principal component analysis

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

CS 340 Lec. 6: Linear Dimensionality Reduction

PCA, Kernel PCA, ICA

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Vector Space Models. wine_spectral.r

Data Preprocessing Tasks

Machine Learning (BSMC-GA 4439) Wenke Liu

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

Principal Component Analysis (PCA)

Tutorial on Principal Component Analysis

18.S096 Problem Set 7 Fall 2013 Factor Models Due Date: 11/14/2013. [ ] variance: E[X] =, and Cov[X] = Σ = =

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Principal Component Analysis

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Lecture 6: Methods for high-dimensional problems

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Principal Components Theory Notes

Unconstrained Ordination

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

Announcements (repeat) Principal Components Analysis

Principal Component Analysis

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Applied Multivariate Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Second-Order Inference for Gaussian Random Curves

Properties of Matrices and Operations on Matrices

Chapter 4: Factor Analysis

Mathematical foundations - linear algebra

The CCA Package. March 20, 2007

Principal Component Analysis CS498

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Vectors and Matrices Statistics with Vectors and Matrices

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Ch.3 Canonical correlation analysis (CCA) [Book, Sect. 2.4]

Machine Learning - MT & 14. PCA and MDS

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

A Peak to the World of Multivariate Statistical Analysis

3.1. The probabilistic view of the principal component analysis.

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Dimensionality Reduction Techniques (DRT)

Data reduction for multivariate analysis

STAT 730 Chapter 9: Factor analysis

Multivariate Statistical Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

(b) The sample canonical correlation matrix is reported below:

Linear Methods in Data Mining

1. Introduction to Multivariate Analysis

Econ Slides from Lecture 7

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Machine Learning 2nd Edition

Unsupervised Learning Basics

Modeling Classes of Shapes Suppose you have a class of shapes with a range of variations: System 2 Overview

Eigenfaces. Face Recognition Using Principal Components Analysis

Linear Subspace Models

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Canonical Correlation Analysis with Kernels

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

CS281 Section 4: Factor Analysis and PCA

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

15 Singular Value Decomposition

18 Bivariate normal distribution I

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

CLASS-SENSITIVE PRINCIPAL COMPONENTS ANALYSIS. Di Miao. Chapel Hill 2015

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Unsupervised Learning: Dimensionality Reduction

Transcription:

Lecture 4: Principal Component Analysis and Linear Dimension Reduction Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: sungkyu@pitt.edu http://www.stat.pitt.edu/sungkyu/aama/ 1 / 56

Linear dimension reduction For a random vector X R p, consider reducing the dimension from p to d, i.e., p variables (X 1,..., X p ) to a set of most interesting d variables. Here, 1 d p. Best Subset? Linear dimension reduction: Construct d variables Z 1,..., Z d as linear combinations of X 1,..., X p, i.e. Z i = a i1 X 1 + + a ip X p = a ix (i = 1,..., d). Linear dimension reduction seeks a sequence of such Z i, or equivalently a sequence of a i, where the random variables (Z i s) are most important among all other choices. 2 / 56

Principal Component Analysis In linear dimension reduction, we require a 1 = 1 and a i, a j = 0. Thus the problem is to find an interesting set of direction vectors {a i : i = 1,..., p}, where the projection scores onto a i are useful. Principal Component Analysis (PCA) is a linear dimension reduction technique that gives a set of direction vectors of maximal (projected) variances. Take m = 1. PCA for the distribution of X finds a 1 such that a 1 = argmax Var(Z 1 (a)), a R p, a =1 where Z 1 (a) = a 1 X 1 + + a p X p = a X. 3 / 56

Geometric understanding of PCA for point cloud 3 2 1 0 1 2 3 3 2 1 0 1 2 3 n (= 200) data points in 2D v_1 v_2 PCA is best understood with a point cloud. Take a look at 2D example. 4 / 56

Geometric understanding of PCA for point cloud n (= 200) data points in 2D Projection Scores Take a = (1, 0). v_2 6 4 2 0 2 4 6 Density 0.0 0.1 0.2 0.3 0.4 3 1 1 2 3 v_1 3 1 1 2 variance 0.79 N = 200 Bandwidth = 0.2778 5 / 56

Geometric understanding of PCA for point n (= 200) data points in 2D Projection cloud Scores v_2 Density 3 1 0 1 2 3 Take a = (1, 0), (0, 1), (1/ 2, 1/ 2). Which one is better? 0.0 0.1 0.2 0.3 4 2 0 2 4 v_1 Projection Scores Density Density 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 3 2 1 0 1 2 variance 0.79 N = 200 Bandwidth = 0.2778 Projection Scores 4 2 0 2 4 variance 1.32 N = 200 Bandwidth = 0.3556 3 1 0 1 2 3 variance 1.04 N = 200 Bandwidth = 0.2918 6 / 56

Geometric understanding of PCA for 3D point cloud A 3D point cloud. Mean. 1st PC direction (maximizing variance of projections) explains the cloud best. 1st and 2nd directions form a plane. 7 / 56

Formulation of population PCA 1 Suppose a random vector X with mean µ, covariance Σ (not necessarily normal). The first principal component (PC) direction vector is the unit vector u 1 R p that maximizes the variance of u 1 X when compared to other unit vectors, i.e., u 1 = argmax Var(u X). u R p, u =1 u 1 = (u 11,..., u 1p ) is the first PC direction vector, sometime called loading vector. (u 11,..., u 1p ) are loadings of the 1st PC. Z 1 = u 11 X 1 + + u 1p X p = u 1X is the first PC score or the first principal component (random variable). λ 1 = Var(Z 1 ) is the variance explained by the first PC score. 8 / 56

Formulation of population PCA 2 The second PC direction is the unit vector u 2 R p that maximizes the variance of u 2 X; is orthogonal to the first PC direction u 1. That is, u 2 = argmax Var(u X). u R p, u =1,u u 1 =0 u 2 = (u 21,..., u 2p ) is the second PC direction vector, and is the vector of the 2nd loadings. Z 2 = u 2X is the second principal component. λ 2 = Var(Z 2 ) is the variance explained by the second PC score, and λ 1 λ 2. Corr(Z 1, Z 2 ) = 0. 9 / 56

Formulation of population PCA (3,4,...p) Given the first k 1 PC directions u 1,..., u k, the kth PC direction is the unit vector u k R p that maximizes the variance of u 2 X; is orthogonal to the first to (k 1)th PC directions u j. That is, u k = argmax Var(u X). u R p, u =1 u u j =0,j=1,...,k 1 u k = (u k1,..., u kp ) is the kth PC direction vector, and is the vector of the kth loadings. Z k = u kx is the kth principal component. λ k = Var(Z k ) is the variance explained by the k PC score, and λ 1 λ k 1 λ k. Corr(Z i, Z j ) = 0 for all i j k. 10 / 56

Relation to eigen-decomposition of Σ Recall the eigen-decomposition of the symmetric positive definite Σ = UΛU with U = [u 1,..., u p ] orthogonal, Λ = diag(λ 1,..., λ p ) with λ 1... λ p, Σu i = λ i u i. In next two slides we show that: 1 The kth eigenvector u k is the kth PC direction vector. 2 The kth eigenvalue λ k is the variance explained by the kth principal component. 3 PC directions are orthogonal u i u j = 0 (i j) and also Σ-orthogonal u iσu j = 0 Cov(Z i, Z j ) = 0 (i j). 11 / 56

Relation to eigen-decomposition of Σ The first PC direction problem is to maximize Var(u X) with the constraint u u = 1. Using Lagrange multiplier λ, the problem of maximization is the same as finding a stationary point of Φ(u, λ) = Var(u X) λ(u u 1) = u Σu λ(u u 1). The stationary point solves the following: which leads 1 Φ(u, λ) = Σu λu = 0, 2 u λ = u Σu = Var(u X), Σu = λu. (1) The eigenvector-eigenvalue pairs (u i, λ i ), (i = 1,..., p) satisfy (1). It is clear that the first PC direction is the first eigenvector u 1, as it gives the largest variance λ 1 = u 1 Σu 1 = Var(u 1 X) λ j (j > 1). 12 / 56

Relation to eigen-decomposition of Σ For the kth PC direction, we form a Lagrangian function Φ(u, λ, γ k 1 ) = u Σu λ(u u 1) k 1 j=1 2γ ju j u, given the first k 1 PC directions. The derivative of Φ, equated to zero, is then 1 2 u Φ(u, λ, γk 1 ) = Σu λu k 1 j=1 γ ju j = 0, Φ(u, λ, γ1 k ) = u γ ju = 0. (2) j We have γ j = u j Σu = 0 (since Σu j = λ j u j ), thus λ = u Σu = Var(u X), Σu = λu. (3) The kth to last eigen-pairs (u i, λ i ), (i = k,..., p) satisfy both (3) and (2). Thus, the kth PC direction is u k, as it gives the largest variance λ k = u k Σu k. 13 / 56

Example: Principal components of bivariate normal distribution Consider N 2 (0, Σ) with [ ] [ ] [ ] [ ] 2 1 cos(π/4) sin(π/4) 3 0 cos(π/4) sin(π/4) Σ = =, 1 2 sin(π/4) cos(π/4) 0 1 sin(π/4) cos(π/4) The PC directions u i are the principal axes of ellipsoids, representing the density of MVN, with lengths given by λ i. 14 / 56

Sample PCA Given multivariate data X = [x 1,..., x n ] p n, the sample PCA sequentially finds orthogonal directions of maximal (projected) sample variance. 1 For x i = x i x, the centered data matrix is X = [ x 1,..., x n ] = X(I n 1 n 11 ). 2 The sample variance matrix is then S = 1 n 1 X X. 3 Eigen-decomposition of S = Û ΛÛ leads to the kth sample PC direction (û k ) and the variance of the kth sample score (ˆλ k ). 15 / 56

Sample PCA It can be checked that the eigenvector û k of S satisfies û k = z (k)i (u) = u x i, argmax Var({z (k)i (u) : i = 1,..., n}), u R p, u =1 u û j =0,j=1,...,k 1 (i = 1,..., n). û k = (û k1,..., û kp ) is the kth sample PC direction vector, and is the vector of the kth loadings. z (k) = (z (k)i,..., z (k)n ) 1 n = û k X is the of kth score vector, where Z (k)i = û k x i. λ k the sample variance of {Z (k)i : i = 1,..., n}. The information of the first two principal components is all contained in the score vectors (z (1), z (2) ), the 2D scatters of which is thus most informative with the largest total variance. 16 / 56

Example: Swiss Bank Note Data Swiss Bank Note data from HS. p = 6, n = 200. Genuine (blue) and fake (red) samples in original measurements. 138 7 11 129.0 129.0 130.5 7 9 11 138 141 V1 V2 V3 V4 V5 V6 214.0 216.0 129.0 130.5 8 10 12 214.0 129.0 8 12 17 / 56

Example: Swiss Bank Note Data Scatters of principal components 3 2 1.0 0.8 3 0 2 1.0 0.5 0.8 0.0 Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 2 0 2 1.5 0.0 1.5 1.0 0.5 Comp.6 1.0 1.5 1.5 2 3 18 / 56

How many components to keep? (1) The criterion for PCA is a high variance in the principal components. The question involves how much the PCs explain the the whole data in terms of variance. 1 Total variance in X is the sum of all marginal variances p Var({x ki : i = 1,..., n}) = diag(s) k=1 = p Var({z (k)i : i = 1,..., n}) = k=1 p λ k. k=1 2 Variance of the kth scores: Var({z (k)i : i = 1,..., n}) = λ k. 3 Total variance in the 1 kth scores: λ 1 +... + λ k. For heuristics made from these quantities,see next slide: 19 / 56

Example: Swiss Bank Note Data scree plot In scree plot (k, λ k ), we look for an elbow. Im cumulative scree plot, (proportion of variance explained, (k, k j=1 λ j p j=1 λ j )), use 90% as a cutoff. Scree plot Cum. Scree plot lambda 0.0 1.0 2.0 3.0 cumulative propotion (%) 0 20 40 60 80 1 2 3 4 5 6 1 2 3 4 5 6 component component 20 / 56

How many components to keep? (2) 1 Kaiser s rule (of thumb): Retain PCs 1 k satisfying λ k > λ = 1 p p λ j. j=1 Tends to choose fewer components. 2 Likelihood ratio testing on null hypothesis H 0 (k) : λ k+1 = = λ p, where k components are used if H 0 (k) is not rejected at a specified level. 21 / 56

Which variables are most responsible for the principal components? Loadings of principal component direction. Biplot - scatterplot of PC1 and PC2 scores, overlaid with p vectors each representing the loadings of the first two PC directions. In the Swiss Bank Note Data, the loadings are Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 V1-0.326 0.562 0.753 V2 0.112-0.259 0.455-0.347-0.767 V3 0.139-0.345 0.415-0.535 0.632 V4 0.768-0.563-0.218-0.186 V5 0.202 0.659-0.557-0.451 0.102 V6-0.579-0.489-0.592-0.258 22 / 56

Example: Swiss Bank Note Data biplot Which measurements are most responsible for the first two principal components? 0.4 0.0 0.4 PC1 loadings V1 V4 0.4 0.0 0.4 PC2 loadings V1 V4 Comp.2 0.2 0.0 0.2 20 10 0 10 20 161 187 160 180 111 182 171 70 116 48 8 19 11 12 4 25 15 13 33 51 29 89 103 107 53 7576 54 59 74 94 7879 67 26 31 82 63 22 64 84 9 717 20 34 38 40 71127 101 109 55 50246 10 1 30 39 37 60 65 66 80 72 61 73 77 42 113 108 138 148 169 189 150 168 V5 178 196 121 194 100 85 90 98 102 95 88 9357 69 86 44 91 27 3 83 87 104 110 141 120 179 128 184 191 200 129 112 99 97 106 124 125 166 197 142 157 134 174 162 V1V2 V3 198 167 153 144 185 137 149 135 193 155 183 199 119 114 96 58 68 32 41 36 52 131132 143 139 158 164 165 170 175 123 117 105 136 145 156 177 163 188 186 181 V6 195 151 140 146 173 18 56 43 16 28 118 172 152 154 122 159 V4 190 5 0.2 0.0 0.2 Comp.1 20 10 0 10 20 23 / 56

Computation of PCA PCA is either computed using eigenvalue decomposition of S = 1 n 1 X X or using the singular value decomposition of X. Eigen-decomposition of S For S = UΛU, 1 PC directions u k (eigenvectors) 2 Variance of PC (scores) λ k (eigenvalues) 3 Matrix of principal component scores V = U X = z (1). z (p). 24 / 56

Computation of PCA SVD of X The singular value decomposition (SVD) of p n matrix X has the form X = UDV. The left singular vectors U = [u 1,..., u p ] p p and the right singular vectors V = [v 1,..., v p ] n p are orthogonal (U U = I p, V V = I p ). The columns of U span the column space of X; the columns of V span the row space. D = diag(d 1,..., d p ), d 1 d 2... d p 0 are the singular values of X. 25 / 56

SVD of X The SVD of X = UDV. Then Computation of PCA S = 1 n 1 XX = 1 n 1 UDV VDU 1 = Udiag( n 1 d j 2 )U 1 PC directions u k (left singular vectors) 2 Variance of PC (scores) 1 n 1 d 2 j (singular values 2 ) 3 Matrix of principal component scores (right singular vectors) z (1) d 1 v 1. = Z = U X = DV =. z (p) d p v p 26 / 56

PCA in R The standard data format is the n p data frame or matrix x. To perform PCA by eigen decomposition: spr <-princomp(x) U<-spr$loadings L<-(spr$sdev)^2 Z <-spr$scores To perform PCA by singular value decompositoin gpr <- prcomp(x) U <- gpr$rotation L <- (gpr$sdev)^2 Z <- gpr$x 27 / 56

Correlation PCA PCA is not scale invariant. Correlation matrix of a random vector X is given by R = D 1 2 Σ ΣD 1 2 Σ, where D Σ is the p p diagonal matrix consisting of diagonal elements of Σ. Correlation PCA: The PC directions and variance of PC scores are obtained by eigen-decomposition of R = U R Λ R U R. Preferred if measurements are not commensurate (e.g. X 1 = household income, X 2 = years in school). 28 / 56

Olivetti Faces data PCA for Olivetti Faces data Obtained from http://www.cs.nyu.edu/~roweis/data.html. Grayscale faces 8 bit [0-255], a few (10) images of several (40) different people. 400 total images, 64x64 size. From the Oivetti database at ATT. 4 samples, 64 x 64 pixels, each pixel has a value 0 255 (darker lighter) 29 / 56

sample 1, 64 x 64 pixels, each pixel has a value 0 255 (darker lighter) Images as data An image is a matrix-valued datum. In Olivetti Faces data, the matrix is of size 64 64, with each pixel having values between [0-255]. The matrix, corresponding one observation, is vecterized (vec d) by stacking each column into one long vector of size d = 4096 = 64 64. So, x 1 is a d 1 vector corresponding to PCA is applied to the data matrix X = [x 1,..., x n ]. 30 / 56

Olivetti Faces data Major components PC2 Direction PC3 Direction PC4 Direction 4 2 x 10 4 0 2000 0 2000 PC1 Direction 2000 0 2000 1000 2000 0 2000 PC1 Direction 0 1000 1000 2000 0 2000 PC1 Direction 0 1000 2000 0 2000 PC1 Direction PC1 Direction PC3 Direction PC4 Direction 2000 0 2000 2000 0 2000 PC2 Direction 5 x 10 4 0 2000 0 2000 PC2 Direction 1000 0 1000 1000 2000 0 2000 PC2 Direction 0 1000 2000 0 2000 PC2 Direction PC1 Direction PC2 Direction PC4 Direction 2000 0 2000 2000 2000 1000 0 1000 PC3 Direction 0 1000 0 1000 PC3 Direction x 10 4 5 0 1000 0 1000 PC3 Direction 1000 0 1000 1000 0 1000 PC3 Direction PC1 Direction PC2 Direction PC3 Direction 2000 0 2000 2000 2000 1000 0 1000 PC4 Direction 0 1000 1000 1000 0 1000 PC4 Direction 0 1000 0 1000 PC4 Direction x 10 4 5 0 1000 0 1000 PC4 Direction 31 / 56

Olivetti Faces data Scree plots 1 scree plot 1 cum. scree plot proportion of variance 0.8 0.6 0.4 0.2 cum. proportion of variance 0.8 0.6 0.4 0.2 0 0 100 200 300 400 component 0 0 100 200 300 400 component 1 1 proportion of variance 0.8 0.6 0.4 0.2 cum. proportion of variance 0.8 0.6 0.4 0.2 0 5 10 15 20 component 0 5 10 15 20 component 32 / 56

Olivetti Faces data Interpretation Examine the mode of variation by walking along the PC direction from the mean. Here, ±1, 2 standard deviations of Z (k). 3000 2000 74 77 80 PC 2 1000 0 1000 2000 2 72 94 93 108 107 110 19 40 109 104 71 73 97 105 242 139269 311 75 255 13616 10106 76 132 133 246 265 14 185 189 223 12 126 157 184 254 235 164 258 260 266313 315 316 262 250 320 319 187 78 43 9635103 222 188 314 249 278 92 378 264 263 261 267 268 270 1025 171 191 293 317 312 21 42 163 32 18 13 228 51 67 200 192 279 373 272 241 318 15 10 5253 82 225 140 244 277 253 275 131 243 100 11417 20 141 144118 172 223 3 11 6230 34 61 29 1 83 68 81 90 91 146 143 174 177 119 155 122 147 138 135 134 154 9598 117 148 182 183 274 190 292 304 193 291 251 276 302 303 347 310 305 186 247 332 331 336 271 339 333364 365 375 372377 382 362 398391 374 387388 368 340 338 366 221 179 342 344 300 396346 259 15150 137 227 257 337335 217 213 218385 389 282 129 7 219 297 115 149 245156 121 89 31 181 44 41 145 120 153 152 37 26 38 47 66 112 33 84 85 142 175 159 178 160 211 220 215 214 201 209 205 206 252 256 231 273 322 324 381 380 330 307 298 296 348 334 323 194 232 125 212 328 341 351 295 294 196 224 16288226 111 176 355 367 393 386 281 290 309 130 65 9 203 24 204 299 376 161165195 168 359 173 64 70 216 306 237 238 395 280 197 63 384 390 325 284 180 287286 48 123 27 50 69 289 283 321 202 327343 361 301 308 363 369 394 400 360 392 370 358 113 236 86 87 198 248 199 230 25 124 49 116 8 329 356 158 229 285 353 352 357 128 354 379 371 383 326 28 46 208 210 207 36 350 45 397 127 6 79 234 349 169 99 288 345 240 239 166 399 167 233 170 4 56 54 60 55 57 3000 3000 2000 1000 0 1000 2000 3000 PC 1 58 59 33 / 56

Olivetti Faces data Interpretation (Eigenfaces) PC walk. + 2 std along PC dir. (Top PC 1, Mid PC 2, Bottom PC 3) PC1 darker to lighter face PC2 feminin to masculine face PC3 oval to rectangle face 34 / 56

Olivetti Faces data Interpretation (Eigenfaces Computation) Vectorized data in X = [x 1,..., x n ], with mean x. Denote (u j, λ j ): the jth PC direction and PC variance. Walk along PC j direction and examine at position s = ±2, ±1, 0 by 1 Reconstruction at s: w s = x ± s λ j u j 2 Convert to image by reshaping the 4096 1 vector w s into 64 64 matrix W s. 35 / 56

Olivetti Faces data Reconstruction of original data Approximation to the original data matrix: x i = x + p Z (j)i u j, (i = 1,..., n) j=1 Approximation of the original observation x i by the first m < p principal components: ˆx i = x + m Z (j)i u j, j=1 The larger m, the better approximation by ˆx i. The smaller m, the more succinct dimension reduction of X. 36 / 56

Reconstruction of original face Observation index i = 5. 5th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 37 / 56

Reconstruction of original face Observation index i = 19. 19th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 38 / 56

Reconstruction of original face Observation index i = 100. 100th face. from top left to bottom right: (mean, 1, 5, 10) & (20, 50, 100, 400) PCs 39 / 56

Olivetti Faces data Reconstruction of original face Human eyes require > 50 principal components to see resemblance between ˆx i and x i. Corresponds to about %90 percent of variance explained in PCs. Subjective and heuristic decision on how many components to use Reconstruction by PCA most useful when - each datum is visually represented (rather than being just numbers). - for example: data objects are images, functions, shapes. 40 / 56

Dimension reduction for two sets of random variables When distinction between explanatory and response variables are not so clear, an analysis dealing with the two sets of variables in a symmetric manner is desired. Examples 1 Relationship between genes expressions and biological variables; 2 Relation between two sets of psychological tests, each with multidimensional measurements. 41 / 56

Example: nutrimouse dataset a nutrition study in the mouse Reference: http://www6.toulouse.inra.fr/toxalim/ pharmaco-moleculaire/acceuil.html Pascal Martin from the Toxicology and Pharmacology Laboratory (French National Institute for Agronomic Research). Obtained from R package, CCA. n = 40 mice, each with r = 120 gene expression levels, associated with nutrition problem, and s = 21 measurements of concentrations of 21 hepatic fatty acids. X 120 40, Y 21 40 42 / 56

X 10 40, Y 21 40 43 / 56 Example: nutrimouse dataset Objective XY correlation 1.0 0.0 0.5 1.0 Focus on the first 10 gene expression levels (Dimension r of the first set of variables is now 10).

Example: nutrimouse dataset Objective Interested in dimension reduction of X and Y, while keeping the important association between X and Y. Find linear dimension reduction of X and Y using the cross-correlation matrix X correlation Y correlation Cross correlation 1.0 0.5 0.0 0.5 1.0 44 / 56

Canonical Correlation Analysis (CCA) CCA seeks to identify and quantify the associations between two sets of variables. Given two random vectors X R r and Y R s (r = s or r s), consider linear dimension reduction of each of two random vectors, ξ = g X = g 1 X 1 + + g r X r, ω = h Y = h 1 Y 1 + + h s Y s. CCA finds the random variables (ξ, ω) or the projection vectors (g, h) that give maximal correlation between ξ and ω, Corr(ξ, ω) = Cov(ξ, ω) Var(ξ)Var(ω) 45 / 56

Population CCA 1 Denote Σ 11 = Cov(X),Σ 22 = Cov(Y) Σ 12 = Cov(X, Y),Σ 21 = Cov(Y, X). The first set of projection vectors (g 1, h 1 ) = argmax g R r,h R s Corr(g X, h Y) (4) g Σ 12 h argmax g R r,h R s g Σ 11 gh Σ 22 h. (5) (g 1, h 1 ) Canonical correlation vectors. (ξ 1 = g 1 X, ω 1 = h 1Y) are called canonical variates. ρ 1 = Corr(ξ 1, ω 1 ) is the largest canonical correlation. 46 / 56

Population CCA 2,3,... The kth set of projection vectors, given (g 1,..., g k 1 ) and (h 1,..., h k 1 ), are (g k, h k ) = g Σ 12 h argmax g R r,h R s g Σ 11 gh Σ 22 h. (6) g Σ 11 g j =0, h Σ 22 h j =0,j=1,...,k 1 (g k, h k ) Canonical correlation vectors. (ξ k = g k X, ω k = h ky) the kth canonical variates. ρ k = Corr(ξ k, ω k ) ρ j, j = 1,..., k 1. Corr(ξ k, ξ j ) = 0, Corr(ω k, ω j ) = 0, j = 1,..., k 1. Generally g i g j = 0 NOT true. 47 / 56

Sample CCA For n sample (x 1, y i ), (i = 1,..., n), denote The sample CCA is S 11 = Ĉov(X),S 22 = Ĉov(Y) S 12 = Ĉov(X, Y),S 21 = Ĉov(Y, X). (g k, h k ) = g S 12 h argmax g R r,h R s g S 11 gh S 22 h. (7) g S 11 g j =0, h S 22 h j =0,j=1,...,k 1 48 / 56

First CC vectors: maximize Finding CCA solution g S 12 h g S 11 gh S 22 h. Change-of-variable: a = S 1 2 11 g, b = S 1 2 22 h. The problem is now to maximize a S 1 2 11 S 12S 1 2 22 b a ab b with respect to a R r, b R s, and the solution is given by a 1 = v 1 (S 1 2 11 S 12S 1 22 S 21S 1 2 11 ), b 1 = v 1 (S 1 2 22 S 21S 1 11 S 12S 1 2 22 ), where v i (M) is the ith eigenvector (corresponding to the ith largest eigenvalue) of symmetric M. 49 / 56

Finding CCA solution For j = 1,..., min(s, r), we have a j = v j (S 1 2 11 S 12S 1 22 S 21S 1 2 11 ), b j = v j (S 1 2 22 S 21S 1 11 S 12S 1 2 22 ), g j = S 1 2 11 a j, h j = S 1 2 22 b j and thus canonical covariate scores ξ j = g j X = (g j x 1,..., g j x n) and ω j = h j Y = (h j y 1,..., h j y n). Note that g j g k = a j S 1 11 a k 0 (not orthogonal); Cov({ξ j(i) }, {ξ k(i) }) = g j S 11g k = 0 (uncorrelated). 50 / 56

nutrimouse data Canonical correlation coefficients Reduced dataset: r = 10 and s = 21 variables with n = 40 samples. min(s, r) = 10 pairs of canonical variates, with decreasing canonical correlation coefficients. res.cc$cor 0.4 0.6 0.8 1.0 2 4 6 8 10 51 / 56

First four pairs of canonical variates {(ξ (j)i, ω (j)i ) : i = 1,..., n}. nutrimouse data Canonical variate scores plot 52 / 56

Original data, with r n. Problem? nutrimouse data with r = 120 XY correlation 1.0 0.0 0.5 1.0 53 / 56

Regularized CCA When r n, and s n, there always exist g S 12 h (g 1, h 1 ) = argmax g R r,h R s g S 11 gh S 22 h, satisfying Corr(ξ j, ω k ) = ±1 (perfect correlation!). This is an artifact from that S 11 and S 22 are not invertible (or the rank of X and Ỹ is at most n 1. A way to circumvent this issue and obtain a meaningful estimate of population canonical correlation coefficients, Regularized CCA is often used: g S 12 h (g k, h k ) = argmax g R r,h R s g (S 11 + λ 1 I r )gh (S 22 + λ 2 I s )h, g S 11 g j =0, h S 22 h j =0,j=1,...,k 1 (8) for λ 1, λ 2 > 0. 54 / 56

nutrimouse data Regularized CCA 1 the sample canonical correlation coefficients ρ j and 2 the corresponding projection vectors h j and g j are varing depending on the choice of regularization parameter λ 1 = λ 2 = 0.0001, 0.01, 1. 55 / 56

CCA in R n r data frame or matrix x and n s data frame or matrix y To perform CCA: cc=cancor(x,y) rho<-cc$cor g<-cc$xcoef h<-cc$ycoef To perform regularized CCA, use package CCA, library(cca) cc=rcc(x,y,lambda1,lambda2) rho<-cc$cor g<-cc$xcoef h<-cc$ycoef 56 / 56