Covariance and Correlation Matrix

Similar documents
1 Principal Components Analysis

Iterative face image feature extraction with Generalized Hebbian Algorithm and a Sanger-like BCM rule

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Maximum variance formulation

Introduction to Machine Learning

Principal Component Analysis

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Principal Component Analysis

7. Variable extraction and dimensionality reduction

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSC 411 Lecture 12: Principal Component Analysis

Principal Component Analysis

Machine Learning 2nd Edition

Unsupervised Learning: Dimensionality Reduction

PCA, Kernel PCA, ICA

Principal Component Analysis (PCA)

PRINCIPAL COMPONENT ANALYSIS

Lecture 7: Con3nuous Latent Variable Models

STA 414/2104: Lecture 8

Computation. For QDA we need to calculate: Lets first consider the case that

STA 414/2104: Lecture 8

MLCC 2015 Dimensionality Reduction and PCA

What is Principal Component Analysis?

Dimensionality Reduction

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

STA 414/2104: Machine Learning

Image Analysis. PCA and Eigenfaces

Principal Components Analysis and Unsupervised Hebbian Learning

Constrained Projection Approximation Algorithms for Principal Component Analysis

PCA and admixture models

Machine Learning (CSE 446): Unsupervised Learning: K-means and Principal Component Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

EECS 275 Matrix Computation

PCA and LDA. Man-Wai MAK

Manifold Learning for Signal and Visual Processing Lecture 9: Probabilistic PCA (PPCA), Factor Analysis, Mixtures of PPCA

Neuroscience Introduction

CSC321 Lecture 20: Autoencoders

Tutorial on Principal Component Analysis

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Lecture VIII Dim. Reduction (I)

Statistical Pattern Recognition

Principal Components Analysis. Sargur Srihari University at Buffalo

Principal Component Analysis (PCA)

Signal Analysis. Principal Component Analysis

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Machine Learning - MT & 14. PCA and MDS

Keywords Eigenface, face recognition, kernel principal component analysis, machine learning. II. LITERATURE REVIEW & OVERVIEW OF PROPOSED METHODOLOGY

Hebb rule book: 'The Organization of Behavior' Theory about the neural bases of learning

Dimensionality Reduction with Principal Component Analysis

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Principal Component Analysis

Unsupervised Learning

PCA and LDA. Man-Wai MAK

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Deep Learning Basics Lecture 7: Factor Analysis. Princeton University COS 495 Instructor: Yingyu Liang

Data Preprocessing Tasks

Mathematical foundations - linear algebra

Pattern Classification

Lecture: Face Recognition

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Nonlinear Dimensionality Reduction

Statistical Machine Learning

Principal Component Analysis (PCA) for Sparse High-Dimensional Data

CS281 Section 4: Factor Analysis and PCA

Principal Components Analysis

MTTS1 Dimensionality Reduction and Visualization Spring 2014 Jaakko Peltonen

Probabilistic & Unsupervised Learning

Machine Learning Techniques

Synaptic Plasticity. Introduction. Biophysics of Synaptic Plasticity. Functional Modes of Synaptic Plasticity. Activity-dependent synaptic plasticity:

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis and Linear Discriminant Analysis

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Principal Components Analysis (PCA)

Principal Component Analysis

Probabilistic Latent Semantic Analysis

Dimensionality Reduction

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linear Dimensionality Reduction

Singular Value Decomposition and Principal Component Analysis (PCA) I

L26: Advanced dimensionality reduction

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

PRINCIPAL COMPONENTS ANALYSIS

Learning with Singular Vectors

Lecture 10: Dimension Reduction Techniques

Dimension Reduction and Low-dimensional Embedding

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Principal Component Analysis CS498

Exercises * on Principal Component Analysis

Kernel methods for comparing distributions, measuring dependence

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Motivating the Covariance Matrix

Data Analysis and Manifold Learning Lecture 6: Probabilistic PCA and Factor Analysis

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

PRINCIPAL COMPONENT ANALYSIS

Notes on Latent Semantic Analysis

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Transcription:

Covariance and Correlation Matrix Given sample {x n } N 1, where x Rd, x n = x 1n x 2n. x dn sample mean x = 1 N N n=1 x n, and entries of sample mean are x i = 1 N N n=1 x in sample covariance matrix is a d d matrix Z with entries Z ij = N 1 1 N n=1 (x in x i )(x jn x j ) sample correlation matrix is a d d matrix C with P N n=1 (x in x i )(x jn x j ) entries C ij = N 1 1 σ xi σ xj, where σ xi and σ xj are the sample standard deviations p. 168

Covariance and Correlation Matrix Example Given sample: [ ] [ 1.2, 0.9 C = [ 2.5 3.9 ] Z =, [ [ 2.443333 1.563117 1.563117 3.940000 2.554082 1.563117 0.7 0.4 ], [ 4.2 5.8 ] 2.443333 3.940000 3.940000 6.523333 ] 3.940000 1.563117 2.554082 6.523333 2.554082 2.554082, x = = ] [ [ 2.15 2.75 ] 1.000000 0.986893 0.986893 1.000000 ] Observe, if sample is z-normalized (x new ij standard deviation 1) then C equals Z. See cov(), cor(), scale() in R. = x ij x i σ xi, mean 0, p. 169

Principal Component Analysis with NN Principal Component Analysis (PCA) is a technique for dimensionality reduction lossy data compression feature extraction data visualization Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data is maximized. u 1 p. 170

Maximize Variance of Projected Data Given data {x n } N 1 where x n has dimensionality d. Goal: project data onto a space having dimensionality m < d while maximizing the variance of the projected data. Let us consider the projection onto one-dimensional space (m = 1). Define direction of this space using a d-dimensional vector u 1. Mean of the projected data is u T 1 x, where x is sample mean N x = 1 N n=1 x n p. 171

Maximize Variance of Projected Data (cont.) Variance of projected data is given by 1 N N n=1 ( u T 1 x n u T 1 x) 2 = u T 1 Su 1 where S is the data covariance matrix defined by S = 1 N N (x n x)(x n x) T n=1 Goal: maximize the projected variance u T 1 Su 1 with respect to u 1. Prevent u 1 growing to infinity, use constrain u T 1 u 1 = 1, gives optimization problem: maximize u T 1 Su 1 subject to u T 1 u 1 = 1 p. 172

Maximize Variance of Projected Data (cont.) Lagrangian form (one Lagrange multiplier λ 1 ): L(u 1,λ 1 ) = u T 1 Su 1 λ 1 (u T 1 u 1 1) Set derivative with respect to u 1 to zero, L(u 1,λ 1 ) u 1 = 0 gives Su 1 = λ 1 u 1 last term says that u 1 must be an eigenvector of S. Finally by left-multiplying by u T 1 and making use of ut 1 u 1 = 1 one can see that the variance is given by u T 1 Su 1 = λ 1. Observe, that variance is maximized when u 1 equals to the eigenvector having largest eigenvalue λ 1. p. 173

Second Principal Component Second eigenvector u 2 should also be of unit length and orthogonal to u 1 (after projection uncorrelated to u T 1 x). maximize u T 2 Su 2 subject to u T 2 u 2 = 1, u T 2 u 1 = 0 Lagrangian form (two Lagrange multipliers λ 1,λ 2 ): L(u 2,λ 1,λ 2 ) = u T 2 Su 2 λ 2 (u T 2 u 2 1) λ 1 (u T 2 u 1 0) This gives solution u T 2 Su 2 = λ 2 which implies that u 2 should be eigenvector of S with second largest eigenvalue λ 2. Other dimensions are given by the eigenvectors with decreasing eigenvalues. p. 174

PCA Example First and second eigenvector Projection on first eigenvector data.xy[,2] 3 2 1 0 1 2 3 4 2 0 2 4 6 data.xy[,1] Projection on both orthogonal eigenvectors data.x.eig[,2] 2 1 0 1 2 cbind(data.x.eig.1, rep(0, N))[,2] 2 1 0 1 2 2 0 2 4 cbind(data.x.eig.1, rep(0, N))[,1] 2 0 2 4 data.x.eig[,1] p. 175

Proportion of Variance In image and speech processing problems the inputs are usually highly correlated. If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m d). As a result, a large reduction in dimensionality can be attained. Proportion of variance explained, digit class 1 (USPS database) λ 1 + λ 2 +... + λ m λ 1 + λ 2 +... + λ m +... + λ d Proportion of variance 0.5 0.6 0.7 0.8 0.9 1.0 0 50 100 150 200 250 Eigenvectors p. 176

8 8 256 PCA Second Example 256 Segment image in 32 32 = 1024 image pieces of size 8 8 1 64: x 1,x 2,...,x 1024 R 64 Determine mean: x = 1024 n=1 x i 1 1024 Determine covariance matrix S and the m eigenvectors u 1,u 2,...,u m having the largest corresponding eigenvalues λ 1,λ 2,...,λ m Create eigenvector matrix U, where u 1,u 2,...,u m are column vectors Project image pieces x i into subspace as follows: z T i = U T (x T i x T ) p. 177

PCA Second Example (cont.) Reconstruct image pieces by back-projecting it to the eti = UzTi + x. Note, mean is added original space as x (substracted step before) because data is not normalized 1.0 Original image 0.8 Proportion of variance Proportion of variance explained in image 0.6 0 10 20 30 40 50 60 Eigenvectors Reconstructed with 16 eigenvectors Reconstructed with 32 eigenvectors Reconstructed with 48 eigenvectors Reconstructed with 64 eigenvectors p. 178

PCA with a Neural Network V w 1 w 2 w d V = w T x = d w j x j j=1 x 1 x 2 x d Apply Hebbian learning rule w i = ηv x i, such that after some update steps weight vector w should point in direction of maximum variance. p. 179

PCA with a Neural Network (cont.) Suppose that there is a stable equilibrium point for w such that the average weight change is zero 0 = w i = V x i = V j w j x j x i = j C ij w j = Cw. Angle brackets indicates an average over the input distribution P(x) and C denotes the correlation matrix with C ij x i x j, or C xx T Note, C is symmetric (C ij = C ji ) and positive semi-definite which implies that its eigenvalues are positive or zero and eigenvectors can be taken as orthogonal. p. 180

PCA with a Neural Network (cont.) At our hypothetical equilibrium point, w is an eigenvector of C with eigenvalue 0 Never stable, because C has some pos. eigenvalues and some corresponding eigenvector would grow exponentially constrain the growth of w, e.g. renormalization ( w = 1) after each update step more elegant idea: adding a weight decay proportional to V 2 to Hebbian learning rule (Oja s Rule) w i = ηv (x i V w i ) p. 181

PCA with a Neural Network Example Oja s Rule (blue vector), largest eigenvector (red vector) data.xy[,2] 3 2 1 0 1 2 3 4 2 0 2 4 data.xy[,1] p. 182

Some insights into Oja s Rule Oja s rule converges to a weight vector w with following properties: w = 1 (unit length), eigenvector direction: w lies in a maximal eigenvector direction of C, variance maximization: w lies in a direction that maximizes V 2 Oja s learning rule is still limited, because we can construct only the first principal component of the z-normalized data. p. 183

Construct the first m principal components Single-layer network with the i-th output V i given by V i = j w ijx j = w T i x, w i is the weight vector for the i-th output Oja s m-unit learning rule Sanger s learning rule w ij = ηv i (x j w ij = ηv i (x j d k=1 i k=1 V k w kj ) V k w kj ) Both rules reduce to Oja s 1-unit rule for the m = 1 and i = 1 case p. 184

Oja s and Sanger s Rule In both cases the w i vectors converge to orthogonal unit vectors Weight vectors become in Sanger s rule exactly the first m principal components, in order w i = ±c i, where c i is normalized eigenvector of the correlation matrix C belonging to the i-th largest eigenvalue λ i Oja s m-unit rule converges to the m weight vectors that span the same subspace as the first m eigenvectors, but do not find the eigenvector directions themselves p. 185

Linear Auto-Associative Network x 1 reconstructed features x d m extracted features z 1 z m reconstruction extraction x 1 original features Network is training to perform identity mapping Idea: bottleneck units represents significant features in the input data Train network by minimizing the sum-of-square error 1 N ( ) d 2 n=1 k=1 y k (x (n) ) x (n) 2 k ) p. 186 x d

Linear Auto-Associative Network (cont.) Equivalent to the Oja s/sanger s update rule, this type of learning can be considered as unsupervised learning, since no independent target data is provided Error function has a unique global minimum when hidden units have linear activations functions At this minimum the network performs a projection onto the m-dimensional sub-space which is spanned by the first m principal components of the data Note, however, that these vectors need not to be orthogonal or normalized p. 187

Non-Linear Auto-Associative Network x1 reconstructed features xd linear non-linear linear z1 zm m extracted features non-linear x1 original features xd p. 188