Statistics-and-PCA. September 7, m = 1 n. x k. k=1. (x k m) 2. Var(x) = S 2 = 1 n 1. k=1

Similar documents
Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Introduction to Machine Learning

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

14 Singular Value Decomposition

Lecture: Face Recognition and Feature Reduction

Principal Component Analysis

Lecture: Face Recognition and Feature Reduction

15 Singular Value Decomposition

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Linear Algebra & Geometry why is linear algebra useful in computer vision?

pset2-sol September 7, 2017 a 2 3

Numerical Methods I Singular Value Decomposition

Maths for Signals and Systems Linear Algebra in Engineering

1 Singular Value Decomposition and Principal Component

Foundations of Computer Vision

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

Principal Components Theory Notes

Linear Algebra & Geometry why is linear algebra useful in computer vision?

CS 143 Linear Algebra Review

Stat 159/259: Linear Algebra Notes

PRINCIPAL COMPONENTS ANALYSIS

Positive Definite Matrix

Maths for Signals and Systems Linear Algebra in Engineering

1 Principal component analysis and dimensional reduction

Principal Component Analysis

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Singular Value Decomposition

Linear Algebra Review. Vectors

Matrix-Exponentials. September 7, dx dt = ax. x(t) = e at x(0)

Problem # Max points possible Actual score Total 120

LEC 2: Principal Component Analysis (PCA) A First Dimensionality Reduction Approach

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Data Preprocessing Tasks

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Conceptual Questions for Review

LECTURE 16: PCA AND SVD

Linear Algebra - Part II

Chapter 3 Transformations

COMP 558 lecture 18 Nov. 15, 2010

IV. Matrix Approximation using Least-Squares

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition

Linear Algebra Primer

(Mathematical Operations with Arrays) Applied Linear Algebra in Geoscience Using MATLAB

Linear Algebra Review. Fei-Fei Li

Machine Learning (Spring 2012) Principal Component Analysis

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Applied Linear Algebra in Geoscience Using MATLAB

7 Principal Component Analysis

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Mathematical foundations - linear algebra

Linear Subspace Models

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

This appendix provides a very basic introduction to linear algebra concepts.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

2. Review of Linear Algebra

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Lecture 6 Sept Data Visualization STAT 442 / 890, CM 462

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

Lecture 1 Systems of Linear Equations and Matrices

Principal Component Analysis

MIT Final Exam Solutions, Spring 2017

Background Mathematics (2/2) 1. David Barber

Review of Linear Algebra

1 Linearity and Linear Systems

Principal Components Analysis (PCA)

Orthogonality. Orthonormal Bases, Orthogonal Matrices. Orthogonality

Principal Component Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

CSC 411 Lecture 12: Principal Component Analysis

Eigenvalues and diagonalization

B553 Lecture 5: Matrix Algebra Review

Lecture 5 Singular value decomposition

Linear Algebra and Eigenproblems

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Principal Component Analysis

Linear Algebra. Session 12

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

Linear Algebra Review. Fei-Fei Li

1. Background: The SVD and the best basis (questions selected from Ch. 6- Can you fill in the exercises?)

Assignment #10: Diagonalization of Symmetric Matrices, Quadratic Forms, Optimization, Singular Value Decomposition. Name:

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Linear Estimation of Y given X:

PRINCIPAL COMPONENT ANALYSIS

MLCC 2015 Dimensionality Reduction and PCA

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Example Linear Algebra Competency Test

There are six more problems on the next two pages

Linear Algebra (Review) Volker Tresp 2017

EE16B Designing Information Devices and Systems II

Gaussian Quiz. Preamble to The Humble Gaussian Distribution. David MacKay 1

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Feature Extraction Techniques

Review of similarity transformation and Singular Value Decomposition

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Transcription:

Statistics-and-PCA September 7, 2017 In [1]: using PyPlot 1 Mean and variance Suppose we have a black box (a distribution) that generates data points x k (samples) k = 1,...,. If we have n data points, the sample mean m is simply the average: m = 1 n In the limit n, we get the mean µ of the underlying distribution from which the samples are generated. The sample variance S 2 is the mean-square deviation from the mean: k=1 Var(x) = S 2 = 1 x k (x k m) 2 where the denominator is Bessel s correction. The limit n of the sample variance gives σ 2, the variance of the underlying distribution, and by using instead of n in the denominator it turns out that we get a better estimate of σ 2 when n is not huge. For example, the randn() function in Julia draws samples from a normal distribution: a Gaussian or bell curve with mean zero and variance 1: k=1 In [2]: x = randn(100) # 100 gaussian random numbers: plot(x, zeros(x),"r.") grid() 1

It is more informative to plot a histogram: In [3]: x = randn(10000) plt[:hist](x, bins=100); 2

The mean is the center of this peak and the square root S of the variance is a measure of the width of this peak. The mean of those 10000 samples is a pretty good estimate for the true mean (= 0) of the underlying normal distribution: In [4]: mean(x) Out[4]: -0.011081208688453203 The sample variance is: In [5]: sum((x-mean(x)).^2)/(length(x)-1) Out[5]: 1.0040402892748792 Or (equivalently but more efficient) the built-in function var: In [6]: var(x) Out[6]: 1.0040402892748792 which is a pretty good estimate for the true variance (= 1). If we looked at more points, we would get better estimates: In [7]: xbig = randn(10^7) mean(xbig), var(xbig) Out[7]: (6.320131515417326e-5,0.9991775483644516) 3

1.1 Mean and variance in linear algebra If we define the vector o = (1, 1,...) to be the vector of n 1 s, with o T o = n, then the mean of x is: m = ot x o T o which is simply the projection of x onto o. And the sample variance is Var(x) = x mo 2 ( = I oot o T o is the length 2 of the projection of x [U+27C2] to o divided by. In fact, the denominator is closely related to the fact that this orthogonal projection lives in an dimensional space: after you subtract off the mean, there are only degrees of freedom left. (This is just a handwaving argument; for more careful derivations of this denominator, see e.g. Bessel s correction on Wikipedia.) 2 Covariance and Correlation A key question in statistics is whether/how two sets of data are correlated. If you have two variables x and y, do they tend to move together? An intuitive measure for this is: when x is greater/less than its mean, is y also greater/less than its mean? Translated into math, this leads to the covariance: Covar(x, y) = 1 ) x n (x k mean(x))(y k mean(y)) = (P x)t (P y) k=1 2 = xt P y where P = I oot is the projection operator from above that subtracts the mean from a vector (i.e. it o T o projects vectors onto the subspace of vectors with zero mean). (In the last step we used the facts that P T = P and P 2 = P.) For example, here are plots of two very correlated vectors x and y of data and a third data set z that is just independent random numbrers In [8]: x = sin(linspace(0,4π,100) + randn(100)*0.1).* (1 + 0.3*randn(100)) y = sin(linspace(0,4π,100) + randn(100)*0.1).* (1 + 0.3*randn(100)) z = randn(100) plot(x, "b.-") plot(y, "r.-") plot(z, "g.-") legend(["x","y","z"]) 4

Out[8]: PyObject <matplotlib.legend.legend object at 0x3290bc690> All three have mean nearly zero: In [9]: mean(x),mean(y),mean(z) Out[9]: (-0.020601421221630608,-0.0011893930679909397,-0.06415236608580241) But the covariance of x and y is totally different from the covariance of x and z: In [10]: # A simple covariance function. See https://github.com/juliastats/statsbase.jl for # better statistical functions in Julia. covar(x,y) = dot(x - mean(x), y - mean(y)) / (length(x) - 1) Out[10]: covar (generic function with 1 method) In [11]: covar(x,y), covar(x,z) Out[11]: (0.4940689175964578,-0.07312995881233886) The variance and covariance have the units of the data squared. I can make the covariance of x and y smaller simply by dividing y by 10, which doesn t see like a good measure of how correlated they are. Often, it is nicer to work with a dimensionless quantity independent of the vector lengths, the correlation: Cor(x, y) = Covar(x, y) = (P x)t (P y) Var(x) Var(y) P x P y This is just the dot product of the vectors (after subtracting their means) divided by their lengths. It turns out that Julia has a built-in function cor that compute precisely this: 5

In [12]: covar(x,y) / sqrt(covar(x,x) * covar(y,y)) # correlation, manually computed Out[12]: 0.935143359379067 In [13]: cor(x,y) Out[13]: 0.9351433593790669 In [14]: cor(x,z) Out[14]: -0.08605379904360413 In [15]: abs(cor(x,y)/cor(x,z)) Out[15]: 10.866961944413662 Now that we ve scaled out the overall length of the vectors, we can sensibly compare the correlation of x,y with the correlation of x,z, and we see that the former are more than 10x the correlation of the latter in this sample 3 The covariance and correlation matrices If we have a bunch of data sets, we might want the covariance or correlation of every pair of data sets. Since these are basically dot products, asking for all of the dot products is the same as asking for a matrix multiplication. In particular, suppose that X is the m n matrix whose rows are n different datasets of length n. First, we need to subtract off the means of each row to form a new matrix A: A = XP = XP T = (P X T ) T where P is the projection matrix from above that subtracts the mean. Multiplying by P T = P on the right corresponds to projecting each row of X. Given A, we can compute all of the covariances simply by computing the covariance matrix S S = AAT since AA T computes all of the dot products of all of the rows of A. The diagonal entries of S are the variances of each dataset, and the off-diagonal elements are the covariances. Alternatively, we can compute the correlation matrix C = ÂÂT, where  is simply the matrix A scaled so that each row has unit length. i.e.  = DA, where D is a diagonal matrix whose entries are the inverse of the length of each row, i.e. D is the inverse of the diagonal entries of AA T. Let s look in more detail at the two correlated vectors x and y from above: In [16]: plot(x,y, ".") xlabel("x") ylabel("y") axhline(0, linestyle="--", color="k", alpha=0.2) axvline(0, linestyle="--", color="k", alpha=0.2) 6

Out[16]: PyObject <matplotlib.lines.line2d object at 0x329302910> The correlation matrix is: In [17]: A = [x - mean(x); y - mean(y)] # rows are x and y with means subtracted Out[17]: 2 100 Array{Float64,2}: -0.0136799 0.194767 0.153322... -0.255841-0.131478 0.00542467-0.0379629-0.0412615 0.0632018-0.169123-0.0637059 0.0591737 In [18]: S = A * A / (length(x)-1) Out[18]: 2 2 Array{Float64,2}: 0.579931 0.494069 0.494069 0.481329 In this case, since there are only two datasets, S is just a 2 2 matrix. 4 PCA: diagonalizing the covariance matrix A key question in analyizing data analysis is to figure out which variables are responsible for most of the variation in the data. These may not be the variables you measured, but may instead be some linear combination of the measured variables! Mathematically, this corresponds to diagonalizing the covariance (or correlation) matrix: 7

S is real-symmetric and positive-definite (or at least semidefinite), so diagonalization S = QΛQ T finds real, positive eigenvalues λ k = σk 2 0 and an orthonormal basis Q of eigenvectors. The eigenvectors form a coordinate system in which the covariance matrix S becomes diagonal, i.e. a coordinate system in which the variables are uncorrelated. The diagonal entries in this coordinate system, the eigenvalues, are the variances of these uncorrelated components. This process of diagonalizing the covariance matrix is called principal component analysis, or PCA. Let s try it: In [19]: σ 2, Q = eig(s) Out[19]: ([0.0341076,1.02715], [0.671084-0.741381; -0.741381-0.671084]) We see that the second eigenvector is responsible for almost all of the variation in the data, because its eigenvalue (the variance σ 2 ) is much larger: In [20]: σ 2 Out[20]: 2-element Array{Float64,1}: 0.0341076 1.02715 Let s plot these two eigenvectors on top of our data: In [21]: plot(x-mean(x),y-mean(y), ".") xlabel("x") ylabel("y") axhline(0, linestyle="--", color="k", alpha=0.2) axvline(0, linestyle="--", color="k", alpha=0.2) arrow(0,0, Q[:,1]..., head_width=0.1, color="r") arrow(0,0, Q[:,2]..., head_width=0.1, color="r") text(q[1,1]+0.1,q[2,1], "q 1 ", color="r") text(q[1,2]-0.2,q[2,2], "q 2 ", color="r") axis("equal") 8

Out[21]: (-1.3857420714936417,1.5314737916078676,-1.371642469431078,1.3080133686711006) The q 2 direction, corresponding to the biggest eigenvalue of the covariance matrix, is indeed the direction with the biggest variation in the data! In this case, it is along the (1,1) direction because x and y tend to move together. The other direction q 1 is the other uncorrelated (= orthogonal) direction of variation in the data. Not much is going on in that direction. (In fact, this q 2 can be viewed as a kind of best fit line which the Strang book calls perpendicular least squares, and is also called Deming regression.) 5 PCA and the SVD Instead of forming AA T and diagonalizing that, we can equivalently (in the absence of roundoff errors) us the singular value decomposition (SVD) of A/. Recall the SVD A = UΣV T where U and V are orthogonal matrices and Σ = singular values σ k. Then, if we compute the covariance matrix, we get: σ 1 σ 2 is a diagonal m n matrix of the... 9

S = AAT = (UΣV T )(UΣV T ) T = UΣV T V Σ T U T = UΣΣ T U T Since ΣΣ T is a diagonal matrix of the squares σ 2 k of the singular values σ k, we find: The squares σk 2 of the singular values are the variances of the uncorrelated components of the data (the eigenvalues of S). The left singular vectors U are precisely the orthonormal eigenvectors of AA T, i.e. the uncorrelated components of the data. In practice, PCA typically uses the SVD directly rather than explicitly forming the covariance matrix S. (It turns out that computing AA T explicitly exacerbates sensitivity to rounding errors and other errors in A.) In [22]: U, σ, V = svd(a / sqrt(length(x)-1)) Out[22]: ( [-0.741381-0.671084; -0.671084 0.741381], In [23]: σ.^2 [1.01349,0.184682], [0.00353214-0.0103205; -0.0115734-0.0877767;... ; 0.0139059 0.0223135; -0.00433678 0.021893] Out[23]: 2-element Array{Float64,1}: 1.02715 0.0341076 As promised, this is the same as the eigenvalues of S from above. Conveniently, the convention for the SVD is to sort the singular values in descending order σ 1 σ 2. So, the first singular value/vector represents most of the variation in the data, and so on. In [24]: U Out[24]: 2 2 Array{Float64,2}: -0.741381-0.671084-0.671084 0.741381 In [25]: plot(x-mean(x),y-mean(y), ".") xlabel("x") ylabel("y") axhline(0, linestyle="--", color="k", alpha=0.2) axvline(0, linestyle="--", color="k", alpha=0.2) arrow(0,0, U[:,1]..., head_width=0.1, color="r") arrow(0,0, U[:,2]..., head_width=0.1, color="r") text(u[1,2]+0.1,u[2,1], "u 1 ", color="r") text(u[1,2]+0.2,u[2,2], "u 2 ", color="r") axis("equal") 10

Out[25]: (-1.3857420714936417,1.5314737916078676,-1.371642469431078,1.3080133686711006) There is an irrelevant sign flip from before (the signs of the eigenvectors and singular vectors are arbitrary), but otherwise it is the same. 11