Data Preprocessing Tasks

Similar documents
Principal Components Analysis (PCA)

Introduction to Machine Learning

Lecture: Face Recognition and Feature Reduction

Lecture: Face Recognition and Feature Reduction

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Mathematical foundations - linear algebra

Nonlinear Dimensionality Reduction

Appendix A: Matrices

Eigenvalues, Eigenvectors, and an Intro to PCA

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Feature Extraction Techniques

Preprocessing & dimensionality reduction

Linear Algebra: Matrix Eigenvalue Problems

Karhunen-Loève Transform KLT. JanKees van der Poel D.Sc. Student, Mechanical Engineering

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

APPLICATIONS The eigenvalues are λ = 5, 5. An orthonormal basis of eigenvectors consists of

CS 246 Review of Linear Algebra 01/17/19

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen PCA. Tobias Scheffer

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

7. Variable extraction and dimensionality reduction

Review of Linear Algebra

15 Singular Value Decomposition

Robot Image Credit: Viktoriya Sukhanova 123RF.com. Dimensionality Reduction

Data reduction for multivariate analysis

Singular Value Decomposition and Principal Component Analysis (PCA) I

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

PRINCIPAL COMPONENTS ANALYSIS

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Econ Slides from Lecture 7

Dimensionality Reduction

Computation. For QDA we need to calculate: Lets first consider the case that

Chapter 4: Factor Analysis

ISSN: (Online) Volume 3, Issue 5, May 2015 International Journal of Advance Research in Computer Science and Management Studies

Principal Components Theory Notes

Principal Component Analysis (PCA) Theory, Practice, and Examples

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

Principal component analysis

Principal Component Analysis

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

L26: Advanced dimensionality reduction

PCA, Kernel PCA, ICA

1 Principal Components Analysis

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Linear Algebra Section 2.6 : LU Decomposition Section 2.7 : Permutations and transposes Wednesday, February 13th Math 301 Week #4

This appendix provides a very basic introduction to linear algebra concepts.

CHAPTER 4 PRINCIPAL COMPONENT ANALYSIS-BASED FUSION

Principal Component Analysis!! Lecture 11!

PY 351 Modern Physics Short assignment 4, Nov. 9, 2018, to be returned in class on Nov. 15.

2. Matrix Algebra and Random Vectors

CS 340 Lec. 6: Linear Dimensionality Reduction

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Unsupervised Learning: Dimensionality Reduction

ICS 6N Computational Linear Algebra Symmetric Matrices and Orthogonal Diagonalization

MATH Topics in Applied Mathematics Lecture 12: Evaluation of determinants. Cross product.

Econ 204 Supplement to Section 3.6 Diagonalization and Quadratic Forms. 1 Diagonalization and Change of Basis

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Lecture 7: Con3nuous Latent Variable Models

CS4495/6495 Introduction to Computer Vision. 8B-L2 Principle Component Analysis (and its use in Computer Vision)

Principal Components Analysis (PCA) and Singular Value Decomposition (SVD) with applications to Microarrays

Principal Component Analysis (PCA)

Dimensionality Reduction

Covariance and Principal Components

What is Principal Component Analysis?

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

2. Every linear system with the same number of equations as unknowns has a unique solution.

Dimensionality Reduction

Foundations of Computer Vision

Principal Component Analysis

Principal component analysis (PCA) for clustering gene expression data

Properties of Matrices and Operations on Matrices

MACHINE LEARNING. Methods for feature extraction and reduction of dimensionality: Probabilistic PCA and kernel PCA

Covariance to PCA. CS 510 Lecture #8 February 17, 2014

This property turns out to be a general property of eigenvectors of a symmetric A that correspond to distinct eigenvalues as we shall see later.

Principal Components Analysis

Lecture Notes 2: Matrices

PCA and admixture models

MATRIX ALGEBRA AND SYSTEMS OF EQUATIONS. + + x 1 x 2. x n 8 (4) 3 4 2

Linear Algebra Methods for Data Mining

3 (Maths) Linear Algebra

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Math 4377/6308 Advanced Linear Algebra

B553 Lecture 5: Matrix Algebra Review

Lecture Notes 1: Vector spaces

Principal Components Analysis. Sargur Srihari University at Buffalo

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Covariance to PCA. CS 510 Lecture #14 February 23, 2018

LECTURE NOTE #11 PROF. ALAN YUILLE

7.6 The Inverse of a Square Matrix

Math Spring 2011 Final Exam

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Transcription:

Data Tasks 1 2 3 Data Reduction 4 We re here. 1

Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can degrade performance. Key idea: Create a new subset features that are representative of the original features. 2

OptDigits Dataset Optical Recognition of Handwritten Digits Dataset 5,620 instances Samples from 43 people 64 features or dimensions 8 8 input matrix Elements are integers 0 16 1 class or target variable A handwritten digit 0 9 3

OptDigits Dataset: Raw Data Normalization Convert to bitmap Blurring Preprocessed Data Downsampling 4

OptDigits Dataset: Random Projection Notice that the data is not well-separated. In general, we would prefer for the data to be betterseparated, particularly by class (digit). 5

Key to Principal Component Analysis A projection should maximize the variation between data, as this captures the uniqueness of these data. Each projection should be made in the direction that captures the most variance that is, the direction where the data is most spread out as this variability reflects the internal structure of the data. 6

Principal Component Analysis (PCA) PCA addresses two objectives: 1. Generally, it is desirable to reduce the number of features used to represent the data. 2. Generally, it is desirable for the set of features to describe a large amount of information. PCA attempts to create features or dimensions sequentially based upon the amount of information they capture. 7

PCA: A Visual Description x 3 To maximize the amount of information or structure retained, we can try to maximize the variance retained by a projection. x 1 x 2 8

PCA: A Visual Description x 3 We can observe this variance based upon the relative sparsity or density of the data. Sparser data has greater variance. Denser data has lesser variance. x 1 x 2 9

PCA: A Visual Description x 3 Our initial projection, v 1, should maximize the variance of the data. x 1 x 2 Original Data 10

PCA: A Visual Description x 2 x 3 Original Data x 1 The next projection should maximize the remaining variance, while being orthogonal (perpendicular) to the previous projection(s). If the projection is not orthogonal, it will be capturing the same variability already captured by previous projection(s). 11

PCA: A Visual Description x 3 3D 2D x 1 x 2 Original Data Transformed Data 12

PCA: A Visual Description v 2 Rotated (for display purposes) v 1 Transformed Data New Features (v 1 and v 2 ) 13

PCA: A Mathematical Description Let s begin with how to represent the data. 14

Data and Vectors Feature data is often as a vector, a quantity that has magnitude and direction. For example, the vectors a 1 and a 2, shown below, each have a length of 3. We ve illustrated a 1 with a direction parallel to the y axis and a 2 with a direction 45 with respect to the x axis. a 1 = 2 6 1 a 2 = 7 5 2 a 1 a 2 15

Data and Matrices Data from several features are often represented as a matrix, a tabular representation of a set of numbers as a collection of rows and columns. For example, the matrix A, shown below, is a 2 by 3 (2 3) matrix. The transpose of A is written as A T, and is produced by interchanging the rows and columns of A. 16 A = 2 6 1 7 5 2 A T = 2 7 6 5 1 2

Vectors and Matrices Notes: Notationally, it is traditional to represent matrices as bold upper case (i.e., A) and row vectors as bold lower case (i.e., a), with all vectors representing row vectors unless explicitly transposed (i.e., a T or a 1 ) to represent column vectors. Also, by convention, a matrix in which all non-diagonal entries are 0 is termed a diagonal matrix. 17

Eigenvalues and Eigenvectors The eigenvalues and eigenvectors of an m by n matrix are, respectively, the scalar values λ and the vectors x that are solutions to the following equation: Ax = λx. In other words, eigenvectors are the vectors that are unchanged, except for magnitude, when multiplied by A. The basic equation is Ax = λx. The number λ is an eigenvalue of A. 18

Eigenvalues and Eigenvectors Consider matrix A and vector x 1 : Notice that: Ax 1 = Ax 2 =.8.3.2.7.8.3.2.7 A =.8.3.2.7 x 1 =.6.4.6.4 = x 1 (Ax = x means that λ 1 = 1) 1 1 =.5.5 (this is 1 x 2 2, so λ 2 = 1 ) 2 Note that if x 1 is multiplied again by A, we still get x 1. 19

Eigenvalues and Eigenvectors A 2 has eigenvalues 1 2 and.5 2 λ = 1 Ax 1 = x 1 =.6.4 λ 2 = 1 Ax 1 = x 1 =.6.4 20 λ =.5 Ax 2 = λ 2 x 2 =.5.5 x 2 = 1 1 λ 2 =.25 A 2 x 2 =.5 2 x 2 =.25.25 The eigenvectors keep their directions, whereas all other vectors change direction. But note that all other vectors are combinations of the (two) eigenvectors.

Variance Recall that the variance is the most common measure of the spread of a set of points: Var x = s x 2 = 1 m 1 m i=1 x i x 2 21

Covariance Recall that the covariance is a measure of the degree to which two variables vary together, and is given by: Cov x i, x j = 1 m 1 m i=1 x ki x i x kj x j where x ki and x kj are the values of the i th and j th feature vectors for the k th object. 22

Covariance Matrices The covariance of two feature vectors X and Y is given by a covariance matrix Cov X, X Cov X, Y C XY = Cov Y, X Cov Y, Y where Cov X, X = Var X, Cov Y, Y = Var Y, and Cov X, Y = Cov Y, X. 23

Covariance Matrices The covariance matrix of n feature vectors is given as Cov X 1, X 1 Cov X 1, X 2 Cov X 1, X n Cov X C X = 2, X 1 Cov X 2, X 2 Cov X 2, X n Cov X n, X 1 Cov X n, X 2 Cov X n, X n where Cov X i, X j = Cov X j, X i and Cov X i, X i 0. 24

Covariance Matrix Decomposition The covariance matrix C X can be decomposed as follows: C X = ΦΛΦ 1 Φ 1 = Φ T Thus, C X can be decomposed as the product of three matrices. Φ is known as the eigenvector matrix and Λ as the eigenvalue matrix. Specifically, the columns of Φ are the eigenvectors of C X, while the diagonal elements of Λ are the eigenvalues of C X. 26

From Linear Algebra to PCA So what does this all have to do with PCA? 27

The Goal of PCA A goal of PCA is to find a transformation of the data that satisfies the following properties: Each pair of new features has 0 covariance. 28 The features are ordered with respect to how much of the variance of the data each feature captures. The first feature captures as much of the variance of the data as possible. Each successive feature captures as much of the remaining variance as possible, as long as it is orthogonal to the previous components.

Covariance Matrix for PCA Given a dataset represented as an m by n matrix D: Centralize the data (subtract the mean). Calculate the covariance matrix C, where each entry of C is: c ij = Cov d i, d j where i and j are features. Then, with each feature of D normalized to have a mean of 0, C = D T D. In words, if the feature means of D are 0, then the covariance matrix C is equal to the product of D and its transpose, D T. 29

PCA: A Mathematical Description As a covariance matrix, C can be decomposed into an eigenvector matrix and an eigenvalue matrix. Thus, with C = D T D: Let λ 1 λ n be the eigenvalues of C. The eigenvalues can be ordered such that λ 1 λ 2 λ m. Let Φ = φ 1,, φ n be the matrix of eigenvectors of C, ordered so that the i th eigenvector corresponds to the i th largest eigenvalue. 30

PCA: A Mathematical Description Then PCA computes: D = DΦ The data matrix D is the set of transformed data the satisfies the posed goals of PCA. In words, D is the matrix that results from taking the product of the original data matrix, D, with Φ, the eigenvector matrix of C. Recall that C is the covariance matrix of D (the product of D and its transpose, D T ). 31

PCA: A Mathematical Description The eigenvectors represent linear combinations of the original features. Each is a new axis or dimension. The eigenvalues are measures of the amount of variance captured by the eigenvectors. Each eigenvector has a corresponding eigenvalue. 32

OptDigits Dataset: PCA Projection OptDigits Data Projection (2D) Notice that the data is wellseparated. In general, each class (digit) is well-separated from the others. 33

OptDigits Dataset: A Comparison Random Projection (2D) PCA Projection (2D) 34

How Many Principal Components? If all of the principal components are kept, then there is no data reduction. Rather, the objective is to minimize the number of retained principal components. Scree Plots can help determine the number of components to retain. 35

Scree Plots A Scree Plot is a simple line segment plot that shows the fraction of total variance in the data as explained or represented by each principal component. The point where there is a significant drop in the explained variance is sometimes called the knee or elbow point of the plot. 36

OptDigits Dataset: PCA Scree Plot 37

OptDigits Dataset: PCA Scree Plot Knee point 38

OptDigits Dataset: A Fun PCA Scree Plot Knee point 39

Advantages of PCA By generating new features that maximize the variance in the data, the data points can often remain well-separated. So what s the catch? Or is class over? 40

Limitations of PCA PCA does not necessarily help segmenting or separating data. PCA generates principal components, which are linear combinations of the original features. Non-linear structure may not be captured. 41

Limitations of PCA: Separating Data PCA does not necessarily help segmenting or separating data, and thus does not necessarily improve modeling. x 2 Well-separated Data First Principal Component Resulting Transformation (Not well-separated data) x 2 42 x 1 x 1

Limitations of PCA: Separating Data In contrast, a better projection might look like the following, which maintains separation of the data. x 2 Well-separated Data A Potential Projection Resulting Transformation (Well-separated data) x 2 v 1 43 x 1 x 1

Swiss Roll Dataset A Synthetic Non-linear Dataset 1,500 instances 3 features or dimensions X is based on cosine function Y is random real value 0 21 Z is based on sine function Looks delicious Swiss Roll Data (3D) 44

Limitations of PCA: Non-linear Data Swiss Roll Data Projection (2D) Notice that as a result of PCA, the projected data is not well-separated. The principal components maximize the variance, but not necessarily the separation of the data. 45

Local Linear Embedding (LLE) Reduce dimensionality by analyzing overlapping local neighborhoods to determine local structure: 1. Find the nearest neighbors of each data point. 46 2. Express each point x i as a linear combination of the other points, i.e., x i = j w ij x j, where w ij = 0 if x j is not a near neighbor of x i. 3. Find the coordinates of each point in lower-dimensional space of specified dimension p by using weights found in step 2.

LLE Projection (Swiss Roll Dataset) Swiss Roll Data Projection (2D) Notice that as a result of LLE, the projected data is well-separated. LLE attempts to retain local structure. 47

Summarizing Data

Summarizing Data is informed by data understanding. Data cleaning involves the correction of data quality problems, including imputing missing data, smoothing noisy data, and removing outliers and inconsistencies. Data transformation involves the aggregation, generalization, and normalization of data, and the possible construction of new features from old ones. Data reduction involves the sampling of data, selection 49 of features, and reduction of dimensionality.

And Now Let s see some data reduction! 50