Dimensionality Reduction and Principal Components

Similar documents
Dimensionality Reduction and Principle Components

EM Algorithm & High Dimensional Data

Metric-based classifiers. Nuno Vasconcelos UCSD

Bayes Decision Theory - I

Mathematical foundations - linear algebra

L5: Quadratic classifiers

L11: Pattern recognition principles

The Bayes classifier

Whitening and Coloring Transformations for Multivariate Gaussian Data. A Slecture for ECE 662 by Maliha Hossain

5. Discriminant analysis

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Problem Set 2. MAS 622J/1.126J: Pattern Recognition and Analysis. Due: 5:00 p.m. on September 30

Lecture: Face Recognition and Feature Reduction

Discriminant analysis and supervised classification

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

CS281 Section 4: Factor Analysis and PCA

Principal Component Analysis

LECTURE NOTE #10 PROF. ALAN YUILLE

STA 414/2104: Lecture 8

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Lecture 3: Review of Linear Algebra

Lecture 3: Review of Linear Algebra

Lecture: Face Recognition and Feature Reduction

Announcements (repeat) Principal Components Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Bayesian Decision Theory

Minimum Error Rate Classification

Least Squares. Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Winter UCSD

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Clustering VS Classification

Table of Contents. Multivariate methods. Introduction II. Introduction I

Motivating the Covariance Matrix

PCA, Kernel PCA, ICA

Chapter XII: Data Pre and Post Processing

Dimensionality Reduction. CS57300 Data Mining Fall Instructor: Bruno Ribeiro

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Matrices and Deformation

Kernel-based density. Nuno Vasconcelos ECE Department, UCSD

4 Gaussian Mixture Models

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

MACHINE LEARNING ADVANCED MACHINE LEARNING

Heeyoul (Henry) Choi. Dept. of Computer Science Texas A&M University

Eigenvalues, Eigenvectors, and an Intro to PCA

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

UNIT 6: The singular value decomposition.

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Data Mining and Analysis: Fundamental Concepts and Algorithms

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

15 Singular Value Decomposition

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

STA 414/2104: Lecture 8

1 Data Arrays and Decompositions

Foundations of Computer Vision

8 Eigenvectors and the Anisotropic Multivariate Gaussian Distribution

Eigenvalues, Eigenvectors, and an Intro to PCA

Visualizing the Multivariate Normal, Lecture 9

Unsupervised Learning: Dimensionality Reduction

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Data Mining Techniques

Eigenvalues, Eigenvectors, and an Intro to PCA

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Week Quadratic forms. Principal axes theorem. Text reference: this material corresponds to parts of sections 5.5, 8.2,

Machine Learning (Spring 2012) Principal Component Analysis

CS 4495 Computer Vision Principle Component Analysis

Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Background Mathematics (2/2) 1. David Barber

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Preprocessing & dimensionality reduction

Principal component analysis

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Computation. For QDA we need to calculate: Lets first consider the case that

Least Squares Optimization

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Introduction to Machine Learning

Linear Algebra: Matrix Eigenvalue Problems

PRINCIPAL COMPONENTS ANALYSIS

STA 414/2104: Machine Learning

CS 143 Linear Algebra Review

Pattern Recognition 2

2. Review of Linear Algebra

Singular Value Decomposition

Regularized Discriminant Analysis and Reduced-Rank LDA

Least Squares Optimization

Advanced Introduction to Machine Learning CMU-10715

Machine Learning 2nd Edition

Example: Face Detection

Review of similarity transformation and Singular Value Decomposition

Lecture 6 Positive Definite Matrices

Machine Learning 11. week

1 Feature Vectors and Time Series

Machine Learning. Theory of Classification and Nonparametric Classifier. Lecture 2, January 16, What is theoretically the best classifier

Vector Space Concepts

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Lecture 3. G. Cowan. Lecture 3 page 1. Lectures on Statistical Data Analysis

Transcription:

Dimensionality Reduction and Principal Components Nuno Vasconcelos (Ken Kreutz-Delgado) UCSD

Motivation Recall, in Bayesian decision theory we have: World: States Y in {1,..., M} and observations of X Class conditional densities P X Y (x y) Class probabilities P Y (i) Bayes decision i rule (BDR) We have seen that this procedure is truly optimal only if all probabilities biliti involved are correctly estimated t One of the most problematic factors in accurately estimating probabilities is the dimension of the feature space 2

Example Cheetah Gaussian classifier, DCT space 8 first DCT features all 64 DCT features Prob. of error: 4% 8% Interesting observation: more features = higher error! 3

Comments on the Example The first reason why this happens is that things are not what we think they are in high dimensions one could say that high dimensional spaces are STRANGE!!! In practice, we invariably have to do some form of dimensionality reduction Eigenvalues play a major role in this One of the major dimensionality reduction techniques is Principal Component Analysis (PCA) 4

The Curse of Dimensionality Typical observation in Bayes decision theory: Error increases when number of features is large This is unintuitive since theoretically: If I have a problem in n-d I can always generate a problem in (n+1)- D without increasing the probability of error, and even often decreasing the probability of error E.g. two uniform classes in 1D A B can be transformed into a 2D problem with the same error Just add a non-informative variable (extra feature dimensions) y 5

Curse of Dimensionality x x y y On the left, even with the new feature (dimension) y, there is no decision boundary that will achieve zero error On the right, the addition of the new feature (dimension) y allows a detection with has zero error 6

Curse of Dimensionality So why do we observe this curse of dimensionality? The problem is the quality of the density estimates BDR optimality assumes perfect estimation of the PDFs This is not easy: Most densities are not simple (Gaussian, exponential, etc.) but a mixture of several factors Many unknowns (# of components, what type), The likelihood lih has multiple l local minima, etc. Even with algorithms like EM, it is difficult to get this right 7

Curse of dimensionality The problem goes much deeper than this: Even for simple models (e.g. Gaussian) we need a large number of examples n to have good estimates Q: what does large mean? This depends on the dimension of the space The best way to see this is to think of an histogram suppose you have 100 points and you need at least 10 bins per axis in order to get a reasonable quantization for uniform data you get, on average, dimension 1 2 3 points/bin 10 1 0.1 which h is decent in1d, bad in 2D, terrible in 3D (9 out of each10 bins are empty!) 8

Curse of Dimensionality This is the curse of dimensionality: For a given classifier the number of examples required to maintain classification accuracy increases exponentially with the dimension of the feature space In higher dimensions the classifier has more parameters Therefore: Higher complexity & Harder to learn 9

Dimensionality Reduction What do we do about this? Avoid unnecessary dimensions Unnecessary features arise in two ways: 1.features are not discriminant 2.features are not independent Non-discriminant means that they do not separate the classes well discriminant non-discriminant 10

Dimensionality Reduction Highly dependent features, even if very discriminant, are not needed - one is enough! E.g. data-mining company studying consumer credit card ratings: X = {salary, mortgage, car loan, # of kids, profession,...} The first three features tend to be highly correlated: the more you make, the higher the mortgage, the more expensive the car you drive from one of these variables I can predict the others very well Including features 2 and 3 does not increase the discrimination, but increases the dimension and leads to poor density estimates t 11

Dimensionality Reduction Q: How do we detect the presence of these correlations? A: The data lives in a low dimensional subspace (up to some amounts of noise). E.g. new feature y salary o o o o o o o o o o o o o o car loan projection onto 1D subspace: y = a x salary o o o o o o o o car loan In the example above we have a 3D hyper-plane plane in 5D If we can find this hyper-plane we can: Project the data onto it Get rid of two dimensions without introducing significant error 12

Principal Components Basic idea: If the data lives in a (lower dimensional) subspace, it is going to look very flat when viewed from the full space, e.g. 1D subspace in 2D 2D subspace in 3D This means that: If we fit a Gaussian to the data the iso-probability contours are going to be highly skewed ellipsoids The directions that explain most of the variance in the fitted data give the Principal Components of the data. 13

Principal Components How do we find these ellipsoids? When we talked about metrics we said that the Mahalanobis distance measures the natural units for the problem obe because it is adapted to the covariance of the data We also know that What is special about it is that it uses Σ -1 Hence, information about possible subspace structure must be in the covariance matrix Σ d x x x 2 1 (, ) ( ) T µ = µ Σ ( µ ) 14

Principal Components & Eigenvectors It turns out that all relevant information is stored in the eigenvalue/vector decomposition of the covariance matrix So, let s start with a brief review of eigenvectors Recall: a n x n (square) matrix can represent a linear operator that maps a vector from the space R n back into the same space (when the domain and codomain of a mapping are the same, the mapping is an automorphism). y1 a11 a1n x1 E.g. the equation y = Ax represents a linear mapping M = M O M M that sends x in R n to yalsoin R n y n an1 a nn x n e n x A en e 1 e 1 e 2 e 2 y 15

Eigenvectors and Eigenvalues What is amazing is that there exist special ( eigen ) vectors which are simply scaled by the mapping: e n x A e n y = λ x e 1 e 1 e 2 These are the eigenvectors of the n x n matrix A They are the solutions φ i to the equation Aϕ = λ ϕ i i i where the scalars λ i are the n eigenvalues of A For a general matrix A, there is NOT a full set of n eigenvectors e 2 16

Eigenvector Decomposition However, If A is n x n, real and symmetric, it has n real eigenvalues and n orthogonal eigenvectors. Note that these can be written all at once A ϕ L A ϕ = λϕ L λ ϕ 1 n 1 1 n n or, using the tricks that we reviewed in the 1 st week I.e. λ1 0 A ϕ1 ϕ n ϕ1 ϕ L = L n O 0 λn AΦ = ΦΛ 1 λ1 0 Φ = ϕ ϕ L n Λ = O 0 λ n 17

Symmetric Matrix Eigendecomposition The n real orthogonal eigenvectors of real A = A T can be taken to have unit norm, in which case Φ is orthogonal so that T T 1 T ΦΦ = Φ Φ = I Φ = Φ AΦ=ΦΛ A =ΦΛΦ This is called the eigenvector decomposition, or eigendecomposition, iti of the matrix A. Because A is real and symmetric, it is a special case of the SVD This factorization of A allows an alternative geometric interpretation to the matrix operation: T y = Ax =ΦΛΦ x T 18

Eigenvector Decomposition This can be seen as a sequence of three steps 1) Apply the inverse orthogonal transformation Φ T x' = Φ T This is a transformation to a rotated coordinate system (plus a possible reflection) 2) Apply the diagonal operator Λ x This is just component-wise scaling in the rotated coordinate system:, λ1 λ1x 1 x'' =Λ x' = O x' = M, λ n λnx n 3) Apply the orthogonal transformation Φ This is a rotation back to the initial coordinate system y = Φx '' 19

Orthogonal Matrices Remember that orthogonal matrices are best understood by considering how the matrix operates on the vectors of the canonical basis (equivalently, on the unit hypersphere) Note that Φ sends e 1 to φ 1 e 2 1 ϕ1 = ϕ1 ϕ K n M 0 sin θ θ Φ cos θ e 1 Φ Τ Since Φ T is the inverse rotation (ignoring reflections), it sends φ 1 to e 1 Hence, the sequence of operations is 1) Rotate (ignoring reflections) φ i to e i (the canonical basis) 2) Scale e i by the eigenvalue λ i 3) Rotate scaled e i back to the initial direction along φ i 20

Eigenvector Decomposition Graphically, these three steps are: e 2 e 2 Λ (1) θ e 1 Φ Τ (2) λ 2 e 2 e 1 λ 1 e 1 (3) λ 2 e 2 θ λ 1 e 1 Φ This means that: A) φ i are the axes of the ellipse B) The width of the ellipse depends on the amount of stretching by λ i 21

Eigendecomposition Note that the stretching is done in Step (2): x'', λ 1x 1 =Λ x' =, λ nx n 2) M λ 1 for x = e i, the length of x is λ i Hence, the overall picture is: The axes are given by the φ i λ 2 e 2 These have length λ i This decomposition can be used to find ``optimal lower dimensional subspaces θ λ 1 e 1 22

Multivariate Gaussian Review The equiprobability contours (level sets) of a Gaussian are the points such that Let s consider the change of variable z = x-µ, which h only moves the origin by µ. The equation is the equation of an ellipse (a hyperellipse). This is easy to see when Σ is diagonal: 23

Gaussian Review This is the equation of an ellipse with principal lengths σ i E.g. when d = 2 is the ellipse z 2 σ 2 σ 1 z1 24

Gaussian Review Introduce a transformation y = Φ z Then y has covariance If Φ is proper orthogonal this is just a rotation and we have y 2 φ 1 z 2 φ 2 σ 1 σ 2 y 1 y = Φ z σ 2 σ 1 z 1 We obtain a rotated ellipse with principal components φ 1 and φ 2 which are the columns of Φ Note that is the eigendecomposition of Σ y 25

Principal Component Analysis (PCA) If y is Gaussian with covariance Σ, the equiprobability contours are the ellipses whose y 2 Principal Components φ i are the eigenvectors of Σ Principal Values (lengths) σ i are the square roots of the eigenvalues λ i of Σ φ 2 σ 2 σ 1 φ 1 y 1 By computing the eigenvalues we know if the data is flat σ 1 >> σ 2 : flat σ 1 = σ 2 : not flat y 2 y 2 σ 2 y σ 1 1 y 1 σ 1 σ 2 26

Learning-based PCA 27

Learning-based PCA 28

END 29