Principal Components Analysis. Sargur Srihari University at Buffalo

Similar documents
7. Variable extraction and dimensionality reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Introduction to Machine Learning

STATISTICAL LEARNING SYSTEMS

LECTURE NOTE #11 PROF. ALAN YUILLE

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Statistical Pattern Recognition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

1 Principal Components Analysis

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Kernel-Based Contrast Functions for Sufficient Dimension Reduction

Mathematical foundations - linear algebra

STA 414/2104: Lecture 8

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

Principal Component Analysis (PCA) CSC411/2515 Tutorial

STA 414/2104: Lecture 8

Computation. For QDA we need to calculate: Lets first consider the case that

Principal component analysis

Linear Dimensionality Reduction

Dimension Reduction and Low-dimensional Embedding

Econ Slides from Lecture 7

Maximum variance formulation

Unsupervised learning: beyond simple clustering and PCA

PCA, Kernel PCA, ICA

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Lecture: Face Recognition

Covariance and Correlation Matrix

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Data Preprocessing Tasks

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

PCA and LDA. Man-Wai MAK

Preprocessing & dimensionality reduction

Lecture 7: Con3nuous Latent Variable Models

Principal Component Analysis

Robustness of Principal Components

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

L26: Advanced dimensionality reduction

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Principal Component Analysis (PCA) Theory, Practice, and Examples

Linear Models Review

Principal Component Analysis (PCA)

Principal component analysis (PCA) for clustering gene expression data

Dimensionality Reduction Techniques (DRT)

Nonlinear Dimensionality Reduction

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

20 Unsupervised Learning and Principal Components Analysis (PCA)

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Principal Component Analysis

Linear Regression Linear Regression with Shrinkage

PCA and LDA. Man-Wai MAK

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Linear Algebra Methods for Data Mining

14 Singular Value Decomposition

Nonlinear Manifold Learning Summary

Unconstrained Ordination

CSC 411 Lecture 12: Principal Component Analysis

Machine Learning 2nd Edition

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

GEOG 4110/5100 Advanced Remote Sensing Lecture 15

Machine Learning (BSMC-GA 4439) Wenke Liu

Dimension Reduction and Classification Using PCA and Factor. Overview

PRINCIPAL COMPONENTS ANALYSIS

Linear Models for Regression. Sargur Srihari

Lecture 10: Dimension Reduction Techniques

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Principal Component Analysis

ECE 661: Homework 10 Fall 2014

Correlation Preserving Unsupervised Discretization. Outline

Vector Space Models. wine_spectral.r

CS 143 Linear Algebra Review

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Dimension Reduction (PCA, ICA, CCA, FLD,

Principal Component Analysis and Singular Value Decomposition. Volker Tresp, Clemens Otte Summer 2014

Principal Component Analysis

Principal Component Analysis (PCA)

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

Linear Regression Linear Regression with Shrinkage

15 Singular Value Decomposition

PCA FACE RECOGNITION

Exercises * on Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Notion of Distance. Metric Distance Binary Vector Distances Tangent Distance

Manning & Schuetze, FSNLP, (c)

MATH 583A REVIEW SESSION #1

LECTURE NOTE #10 PROF. ALAN YUILLE

STAT 730 Chapter 14: Multidimensional scaling

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Transcription:

Principal Components Analysis Sargur Srihari University at Buffalo 1

Topics Projection Pursuit Methods Principal Components Examples of using PCA Graphical use of PCA Multidimensional Scaling Srihari 2

Motivation Scatterplots Good for two variables at a time Disadvantage may miss complicated relationships PCA is a method to transform into new variables Projections along different directions to detect relationships Say along direction defined by 2x 1 +3x 2 +x 3 =0 3

Projection pursuit methods Allow searching for interesting directions Interesting means maximum variability Data in 2-d space projected to 1-d: x 1 2x 1 +3x 2 =0 Projection Task is to find a 4 x 2

Principal Components Find linear combinations that maximize variance subject to being uncorrelated with those already selected Hopefully there are few such linear combinations-- known as principal components Task is to find a k-dimensional projection where 0 < k < d-1 Srihari 5

Data Matrix Definition X = n x d data matrix of n cases x(1) d variables x(i) x(n) x(i) is a d x 1 column vector Each row of matrix is of the form x(i) T Assume X is mean-centered, so that the value of each variable is 6 subtracted for that variable

Projection Definition Let a be a p x 1 column vector of projection weights that result in the largest variance when the data X are projected along a Projection of a data vector x = (x 1,..x p ) t onto a = (a 1,..,a p ) t is the linear combination a t x = p j=1 a j x j Projected values of all data vectors in X onto a is Xa -- an n x 1 column vector-- a set of scalar values corresponding to n projected points Since X is n x p and a is p x 1 Therefore Xa is n x 1 7

Variance along Projection Variance along a is σ 2 a = ( Xa) T ( Xa) = a T X t Xa = a T Va where V = X t X is the p p covariance matrix of the data since X has zero mean Thus variance is a function of both the projection line a and the covariance matrix V 8

Maximization of Variance Maximizing variance along a is not well-defined since we can increase it without limit by increasing the size of the components of a. Impose a normalization constraint on the a vectors such that a T a = 1 Optimization problem is to maximize u = a t Va λ(a t a 1) Where λ is a Lagrange multiplier. Differentiating wrt a yields u = 2Va 2λa = 0 a which reduces to (V - λi)a = 0 Characteristic Equation!

What is the Characteristic Equation? Given a d x d matrix V a very important class of linear Equations is of the form which can be rewritten as Vx = λx d x d d x 1 d x 1 (V λi)x = 0 If V is real and symmetric there are d possible solution vectors, called Eigen Vectors, e 1, e d and associated Eigen values Srihari 10

Principal Component is obtained from the Covariance Matrix If the matrix V is the Covariance matrix Then its Characteristic Equation is (V λi)a = 0 Roots are Eigen Values Corresponding Eigen Vectors are principal components First principal component is the Eigen Vector associated with the largest Eigen value of V. Srihari 11

Other Principal Components Second Principal component is in direction orthogonal to first Has second largest Eigen value, etc X 2 Second Principal Component e 2 First Principal Component e 1 X 1 12

Projection into k Eigen Vectors Variance of data projected into first k Eigen vectors e 1,..e k is Squared error in approximating true data matrix X using only first k Eigen vectors is d λ j j= k +1 d λ l How to choose k? increase k until squared error is less than a threshold Usually 5-10 principal components capture 90% variance in data l=1 Srihari 13

Example of PCA CPU data Eigen values of Correlation Matrix CPU data 8 Eigen values: 63.26 10.70 10.30 6.68 5.23 2.18 1.31 0.34 Percent Variance Explained Scatterplot Matrix Scree Plot Eigen Value number Weights put by first component e 1 on eight variables are: 0.199-0.365-0.399-0.336-0.331-0.298-0.421-0.423 Amount of variance explained by each consecutive Eigen value An example Eigen Vector 14

PCA using correlation matrix and covariance matrix Scree Plot Correlation Matrix Percent Variance Explained Scree Plot Covariance Matrix Percent Variance Explained Eigen Value number Eigen Value number Proportions of variation attributable to different components: 96.02 3.93 0.04 0.01 0 0 0 0 15

Graphical Use of PCA Projection onto first two principal components of six dimensional data 17 pills (data points) Six values are times at which specified proportion of pill has dissolved: 10%, 30%, 50%, 70%, 75%, 90% Principal Component 2 Pill 3 is very different Principal Component 1 Srihari 16

Computational Issue: Scaling with Dimensionality O(nd 2 +d 3 ) To calculate V Solve Eigen value equations for the d x d matrix Can be applied to large numbers of records n But does not scale well with dimensionality d 17 Also, appropriate Scalings of variables have to be done

Multidimensional Scaling Using PCA to project on a plane is effective only if data lie on 2-d subspace Intrinsic Dimensionality Data may lie on string or surface in d-space E.g., when a digit image is translated and rotated Then images in pixel space lie on a 3-dimensional manifold (defined by location and orientation) Srihari 18

Goal of Multidimensional Scaling Detecting underlying structure Represent data in lower dimensional space so that distances are preserved Distances between data points are mapped to a reduced space Typically displayed on a 2-d plot Begin with distances and then compute the plot E.g., psychometrics and market research where similarities between objects are given by subjects 19

Defining the B Matrix For an n x d data matrix X we could compute n x n matrix B = XX t We will see (next slide) that the Euclidean distance between the i th and j th objects is given by d ij2 =b ii +b jj -2b ij Matrices XX t and X t X are both meaningful Srihari 20

X t X versus XX t If X is n x d d=4 X t X is d x d d x n n x d Covariance Matrix d x d B=XX t is n x n n x d d x n n x n d ij2 =b ii +b jj -2b ij B Matrix contains distance information

Factorizing the B matrix Given a matrix of distances D Derived from original data by computing n(n-1)/2 distances Compute elements of B by inverting Factorize B d ij2 =b ii +b jj -2b ij in terms of eigen vectors to yield coordinates of points Two largest eigen values would give 2-d representation Srihari 22

Inverting distances to get B Summing over i d ij2 =b ii +b jj -2b ij Summing over j Can obtain b jj Can obtain b ii Summing over i and j Can obtain tr(b) Thus expressing b ij as a function of d ij 2 Method is known as Principal Coordinates Method 23

Criterion for Multidimensional Scaling Find projection into two dimensions to minimize Observed distance between points i and j in d-space Distance between the points in two-dimensional space Criterion is invariant wrt rotations and translations. However it is not invariant to scaling Better criterion is or Srihari Called stress 24

Algorithm for Multidimensional Scaling Two stage procedure Assume that d ij =a+bδ ij +e ij Original dissimilarities Regressioin in 2-D on given dissimilarities yielding estimates for a and b Find new values of d ij that minimize the stress Repeat until convergence Srihari 25

Multidimensional Scaling Plot: Dialect Similarities Numerical codes of villages and their counties Each Pair of villages rated by percentage of 60 items for which villagers used different words 26 We are able to visualize 625 distances intuitively

Variations of Multidimensional Scaling Above methods are called metric methods Sometimes precise similarities may not be known only rank orderings Also may not be able to assume a particular form of relationship between d ij and δ ij Requires a two-stage approach Replace simple linear regression with monotonic regression Srihari 27

Multidimensional Scaling: Disadvantages When there are too many data points structure becomes obscured Highly sophisticated transformations of the data (compared to scatter lots and PCA) Possibility of introducing artifacts Dissimilarities can be more accurately determined when they are similar than when they are very dissimilar Horseshoe effect when objects manufactured in a short time span differ greatly from objects separated by greater time gap Biplots show both data points and variables Srihari 28