Introduction to Machine Learning

Similar documents
Neuroscience Introduction

PCA and admixture models

7. Variable extraction and dimensionality reduction

PCA, Kernel PCA, ICA

Principal Component Analysis

Machine Learning - MT & 14. PCA and MDS

Lecture: Face Recognition and Feature Reduction

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

CSC 411 Lecture 12: Principal Component Analysis

Dimensionality Reduction

1 Principal Components Analysis

Mathematical foundations - linear algebra

Review of Linear Algebra

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Principal Component Analysis

Linear Algebra & Geometry why is linear algebra useful in computer vision?

7 Principal Component Analysis

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Lecture: Face Recognition and Feature Reduction

Uncorrelated Multilinear Principal Component Analysis through Successive Variance Maximization

Data Preprocessing Tasks

Machine Learning (Spring 2012) Principal Component Analysis

Gopalkrishna Veni. Project 4 (Active Shape Models)

Dimensionality Reduction with Principal Component Analysis

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

Statistical Pattern Recognition

CS281 Section 4: Factor Analysis and PCA

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Machine Learning. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Methods for sparse analysis of high-dimensional data, II

Dimensionality Reduction

Principal Component Analysis (PCA)

14 Singular Value Decomposition

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

This appendix provides a very basic introduction to linear algebra concepts.

Preprocessing & dimensionality reduction

Advanced Introduction to Machine Learning CMU-10715

Descriptive Statistics

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

20 Unsupervised Learning and Principal Components Analysis (PCA)

PRINCIPAL COMPONENTS ANALYSIS

Covariance and Principal Components

Lecture 24: Principal Component Analysis. Aykut Erdem May 2016 Hacettepe University

Computation. For QDA we need to calculate: Lets first consider the case that

Principal Component Analysis (PCA) CSC411/2515 Tutorial

Notes on Latent Semantic Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Singular Value Decomposition and Principal Component Analysis (PCA) I

Principal Components Analysis (PCA)

Principal Component Analysis

STA 414/2104: Lecture 8

Multivariate Statistical Analysis

Linear Algebra Methods for Data Mining

LECTURE 16: PCA AND SVD

2.3. Clustering or vector quantization 57

Linear Dimensionality Reduction

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Principal components analysis COMS 4771

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

The Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Unsupervised Learning: Dimensionality Reduction

Methods for sparse analysis of high-dimensional data, II

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Unsupervised Learning

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

4 Bias-Variance for Ridge Regression (24 points)

Covariance and Correlation Matrix

Face Recognition and Biometric Systems

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Principal Components Analysis. Sargur Srihari University at Buffalo

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

PRINCIPAL COMPONENT ANALYSIS

Cheng Soon Ong & Christian Walder. Canberra February June 2018

CS 246 Review of Linear Algebra 01/17/19

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Principal Component Analysis

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Machine learning for pervasive systems Classification in high-dimensional spaces

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

PRINCIPAL COMPONENT ANALYSIS

CSE 494/598 Lecture-6: Latent Semantic Indexing. **Content adapted from last year s slides

Machine Learning: Basis and Wavelet 김화평 (CSE ) Medical Image computing lab 서진근교수연구실 Haar DWT in 2 levels

Table of Contents. Multivariate methods. Introduction II. Introduction I

System 1 (last lecture) : limited to rigidly structured shapes. System 2 : recognition of a class of varying shapes. Need to:

STA 414/2104: Lecture 8

Chapter 3 Transformations

Example Linear Algebra Competency Test

CSC321 Lecture 20: Autoencoders

What is Principal Component Analysis?

j=1 u 1jv 1j. 1/ 2 Lemma 1. An orthogonal set of vectors must be linearly independent.

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

Transcription:

10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018

PCA

Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what to measure or how to represent our measurements we might find simple relationships But in practice we often measure redundant signals, e.g., US and European shoe sizes We also represent data via the method by which it was gathered, e.g., pixel representation of brain imaging data

Dimensionality Reduction Issues Measure redundant signals Represent data via the method by which it was gathered Goal: Find a better representation for data To visualize and discover hidden patterns Preprocessing for supervised task How do we define better?

E.g., Shoe Size We take noisy measurements on European and American scale Modulo noise, we expect perfect correlation How can we do better, i.e., find a simpler, compact representation? Pick a direction and project onto this direction European Size American Size

E.g., Shoe Size We take noisy measurements on European and American scale Modulo noise, we expect perfect correlation How can we do better, i.e., find a simpler, compact representation? Pick a direction and project onto this direction European Size American Size

E.g., Shoe Size We take noisy measurements on European and American scale Modulo noise, we expect perfect correlation How can we do better, i.e., find a simpler, compact representation? Pick a direction and project onto this direction European Size American Size

E.g., Shoe Size We take noisy measurements on European and American scale Modulo noise, we expect perfect correlation How can we do better, i.e., find a simpler, compact representation? Pick a direction and project onto this direction European Size American Size

Goal: Minimize Reconstruction Error Minimize Euclidean distances between original points and their projections PCA solution solves this problem! European Size American Size

Linear Regression predict y from x. Evaluate accuracy of predictions (represented by blue line) by vertical distances between points and the line y PCA reconstruct 2D data via 2D data with single degree of freedom. Evaluate reconstructions (represented by blue line) by Euclidean distances European Size x American Size

Another Goal: Maximize Variance To identify patterns we want to study variation across observations Can we do better, i.e., find a compact representation that captures variation? European Size American Size

Another Goal: Maximize Variance To identify patterns we want to study variation across observations Can we do better, i.e., find a compact representation that captures variation? European Size American Size

Another Goal: Maximize Variance To identify patterns we want to study variation across observations Can we do better, i.e., find a compact representation that captures variation? PCA solution finds directions of maximal variance! European Size American Size

PCA Formulation PCA: find lower-dimensional representation of raw data X is n d (raw data) Z = XP is n k (reduced representation, PCA scores ) P is d k (columns are k principal components) Variance constraints Linearity assumption ( Z = XP ) simplifies problem Z = X P

Given n training points with d features: X R n d : matrix storing points x (i) j : jth feature for ith point μ j : mean of jth feature Variance of 1st feature σ 2 1 = 1 n n i=1 x (i) 1 μ 1 2 Variance of 1st feature (assuming zero mean) σ 2 1 = 1 n n i=1 x (i) 1 2

Given n training points with d features: X R n d : matrix storing points x (i) j : jth feature for ith point μ j : mean of jth feature Covariance of 1st and 2nd features (assuming zero mean) Symmetric: σ 12 = σ 21 Zero uncorrelated σ 12 = 1 n Large magnitude (anti) correlated / redundant σ 12 = σ 2 1 = σ 2 2 features are the same n i=1 x (i) 1 x(i) 2

Covariance Matrix Covariance matrix generalizes this idea for many features d d covariance matrix with zero mean features C X = 1 n X X ith diagonal entry equals variance of ith feature ijth entry is covariance between ith and jth features Symmetric (makes sense given definition of covariance)

PCA Formulation PCA: find lower-dimensional representation of raw data X is n d (raw data) Z = XP is n k (reduced representation, PCA scores ) P is d k (columns are k principal components) Variance / Covariance constraints What constraints make sense in reduced representation? No feature correlation, i.e., all off-diagonals in C Z are zero Rank-ordered features by variance, i.e., sorted diagonals of C Z

PCA Formulation PCA: find lower-dimensional representation of raw data X is n d (raw data) Z = XP is n k (reduced representation, PCA scores ) P is d k (columns are k principal components) Variance / Covariance constraints P equals the top k eigenvectors of C X Z = X P

PCA Solution All covariance matrices have an eigendecomposition C X = UΛU (eigendecomposition) U is d d (column are eigenvectors, sorted by their eigenvalues) Λ is d d (diagonals are eigenvalues, off-diagonals are zero) The d eigenvectors are orthonormal directions of max variance Associated eigenvalues equal variance in these directions 1st eigenvector is direction of max variance (variance is λ 1 )

Choosing k How should we pick the dimension of the new representation? Visualization: Pick top 2 or 3 dimensions for plotting purposes Other analyses: Capture most of the variance in the data Recall that eigenvalues are variances in the directions specified by eigenvectors, and that eigenvalues are sorted Fraction of retained variance: k i=1 λ i d i=1 λ i Can choose k such that we retain some fraction of the variance, e.g., 95%

Other Practical Tips PCA assumptions (linearity, orthogonality) not always appropriate Various extensions to PCA with different underlying assumptions, e.g., manifold learning, Kernel PCA, ICA Centering is crucial, i.e., we must preprocess data so that all features have zero mean before applying PCA PCA results dependent on scaling of data Data is sometimes rescaled in practice before applying PCA

Orthogonal and Orthonormal Vectors Orthogonal vectors are perpendicular to each other Equivalently, their dot product equals zero a b = 0 and d b = 0, but c isn t orthogonal to others a = 1 0 b = 0 1 c = 1 1 d = 2 0 Orthonormal vectors are orthogonal and have unit norm a are b are orthonormal, but b are d are not orthonormal

PCA Iterative Algorithm k = 1: Find direction of max variance, project onto this direction Locations along this direction are the new 1D representation More generally, for i in {1,, k}: Find direction of max variance that is orthonormal to previously selected directions, project onto this direction Locations along this direction are the ith feature in new representation European Size American Size

Eigendecomposition All covariance matrices have an eigendecomposition C X = UΛU (eigendecomposition) U is d d (column are eigenvectors, sorted by their eigenvalues) Λ is d d (diagonals are eigenvalues, off-diagonals are zero) Eigenvector / Eigenvalue equation: By definition u u = 1 (unit norm) Example: 1 0 0 1 1 0 = 1 0 C x u = λu eigenvector: u = 1 0 eigenvalue: λ = 1

PCA Formulation PCA: find lower-dimensional representation of raw data X is n d (raw data) Z = XP is n k (reduced representation, PCA scores ) P is d k (columns are k principal components) Variance / Covariance constraints Z = X P

PCA Formulation, k = 1 PCA: find one-dimensional representation of raw data X is n d (raw data) z = Xp is n 1 (reduced representation, PCA scores ) p is d 1 (columns are k principal components) Variance constraint n σ 2 z = 1 n i=1 z (i) 2 = z 2 2 Goal: Maximizes variance, i.e., max σ 2 z where p 2 = 1 p

Goal: Maximizes variance, i.e., max p σ2 z where p 2 = 1 σ 2 z = 1 n z 2 2 Relationship between Euclidean distance and dot product Definition: z = Xp = 1 n z z = 1 (Xp) (Xp) n Transpose property: (Xp) = p X ; associativity of multiply Definition: C X = 1 n X X = 1 n p X Xp = p C X p Restated Goal: max p C xp where p 2 = 1 p

Connection to Eigenvectors Recall eigenvector / eigenvalue equation: C x u = λu By definition u u = 1, and thus u C x u = λ But this is the expression we re optimizing, and thus maximal variance achieved when p is top eigenvector of Similar arguments can be used for k > 1 C X Restated Goal: max p C xp where p 2 = 1 p