Dimension Reduction and Classification Using PCA and Factor. Overview

Similar documents
Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Dimensionality Reduction Techniques (DRT)

Chapter 4: Factor Analysis

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

PRINCIPAL COMPONENTS ANALYSIS

Principal Component Analysis

Principal Components Analysis. Sargur Srihari University at Buffalo

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

UCLA STAT 233 Statistical Methods in Biomedical Imaging

Multidimensional scaling (MDS)

Applied Multivariate Analysis

Vector Space Models. wine_spectral.r

12.2 Dimensionality Reduction

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

Multivariate Fundamentals: Rotation. Exploratory Factor Analysis

Notes on Latent Semantic Analysis

Principal Components Analysis (PCA)

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Linear Algebra in a Nutshell: PCA. can be seen as a dimensionality reduction technique Baroni & Evert. Baroni & Evert

AP Statistics Unit 2 (Chapters 7-10) Warm-Ups: Part 1

Intermediate Social Statistics

Quantitative Understanding in Biology Principal Components Analysis

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Third Grade Report Card Rubric 1 Exceeding 2 Meeting 3 Developing 4 Area of Concern

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

The response variable depends on the explanatory variable.

Chapter 3: Examining Relationships

********************************************************************************************************

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Feature Transformation

[ z = 1.48 ; accept H 0 ]

PRINCIPAL COMPONENT ANALYSIS

Multivariate Statistics

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Introduction to Machine Learning

Key Algebraic Results in Linear Regression

PCA, Kernel PCA, ICA

Unconstrained Ordination

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

Methods for sparse analysis of high-dimensional data, II

Vocabulary: Samples and Populations

Eigenvalues, Eigenvectors, and an Intro to PCA

Eigenvalues, Eigenvectors, and an Intro to PCA

PROCESS MONITORING OF THREE TANK SYSTEM. Outline Introduction Automation system PCA method Process monitoring with T 2 and Q statistics Conclusions

Principal Component Analysis

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

Lesson 3 - Linear Functions

New York State Testing Program Grade 8 Common Core Mathematics Test. Released Questions. June 2017

Multiple Regression. Midterm results: AVG = 26.5 (88%) A = 27+ B = C =

Deriving Principal Component Analysis (PCA)

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal component analysis (PCA) for clustering gene expression data

Machine Learning 2nd Edition

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Principal Component Analysis (PCA) Theory, Practice, and Examples

Learning From Data: Modelling as an Optimisation Problem

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

New York State Testing Program Grade 8 Common Core Mathematics Test Released Questions June 2017

Introduction to Linear Regression

CS281 Section 4: Factor Analysis and PCA

Computer exercise 3: PCA, CCA and factors. Principal component analysis. Eigenvalues and eigenvectors

Announcements (repeat) Principal Components Analysis

Principal component analysis

1 Principal Components Analysis

Exploratory Factor Analysis and Principal Component Analysis

Principal Components Theory Notes

Principal component analysis, PCA

Eigenfaces. Face Recognition Using Principal Components Analysis

Regression Models REVISED TEACHING SUGGESTIONS ALTERNATIVE EXAMPLES

Pollution Sources Detection via Principal Component Analysis and Rotation

What is Principal Component Analysis?

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

College Algebra. Word Problems

1 Correlation and Inference from Regression

Principal components

Practice A. Name Date. Evaluate the expression for the given value of the variable. Match the equation with its solution. Solve the equation.

Table of Contents. Multivariate methods. Introduction II. Introduction I

Quantitative Understanding in Biology Short Course Session 9 Principal Components Analysis

HW Unit 7: Connections (Graphs, Equations and Inequalities)

Basics of Multivariate Modelling and Data Analysis

AP Statistics - Chapter 2A Extra Practice

Exploratory Factor Analysis and Principal Component Analysis

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Lecture 02 Linear Algebra Basics

3.1. The probabilistic view of the principal component analysis.

Practice Questions for Exam 1

PREPARING FOR THE CLAST MATHEMATICS Ignacio Bello

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Regression: Ordinary Least Squares

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Applied Regression Modeling: A Business Approach Chapter 3: Multiple Linear Regression Sections

Multivariate Statistical Analysis

Statistics: A review. Why statistics?

Principal Component Analysis

Sem. 1 Review Ch. 1-3

Multilevel Analysis, with Extensions

2011 Pearson Education, Inc

Transcription:

Dimension Reduction and Classification Using PCA and - A Short Overview Laboratory for Interdisciplinary Statistical Analysis Department of Statistics Virginia Tech http://www.stat.vt.edu/consult/ March 2, 2009

Outline - Disussions Difference between and

The Problem What to do when you have too many predictors in a model? For example you have expression level data for 1000 genes! Or you have customer attributes in hundreds and you are interested in making a predictive model based on customer attributes! Or you have second by second stock market data over a trading day for stocks! Or in survey data where multiple questions might capture the same kind of information (highly correlated)

The Cars Outline A researcher wants to build a model to find out which variables are most significant in predicting the demand for cars but believes that a lot of variables have high correlation and the study can be effectively done on a small number of variables without losing much information.

The Problem Outline Given a data set with N observations like X = (x 1,..., x p ) for a very large p. Figure: Data with 11 possible predictors

The Problem Outline How do we reduce the number of columns in X but still not throw away too much information?

The Problem Outline JMP Analyze Multivariate Methods Principal Components Mutlivariate(Tab) Scatterplot Matrix

The Problem Notice the highly correlated variables! We will attempt to explain most of the variability in the data, but use a small number of principal components (parsimony) if it is possible.

The Geometric Interpretation We intend to come up with rotations and projections in p dimensions that captures most of the variability. Figure: Plot of in three dimensions

The Geometric Interpretation - Eigens We can write the principal components as: Y 1 = a 1 X... Y p = a p X such that the Y s are uncorrelated and the variances for each Y is as large as possible. We find out the eigenvalues λ of the data matrix and rank them in terms of their size. The a s are obtained from the corresponding eigenvectors and the eigenvalues correspond to corresponding variances. Since Total population Variance = λ 1 + + λ p Variance explained by the k th principal component = λ k λ 1 + +λ p

The Geometric Interpretation - Eigens Summary Principal components are determined by our predictors There is a principal component for every eigenvalue The value of the eigenvalue gives a measure of much variation the corresponding principal component explains

The Geometric Interpretation - Eigens Summary By choosing the first few principal components (and hence eigenvalues) we might be able to explain a lot of the variation among the predictors (not all!) Hence we throw away some information but hopefully not much

The Cars We have data about 387 cars with the following variables Suggested Retail Price Invoice price Engine Size (liters) Number of Cylinders (=-1 if rotary engine) Horsepower City Miles Per Gallon Highway Miles Per Gallon Weight (Pounds) Wheel Base (inches) Length (inches) Width (inches)

The Cars Again A researcher wants to build a model to find out which variables are most significant in predicting the demand for cars but believes that a lot of variables have high correlation and the study can be effectively done on a small number of variables without losing much information. But how to choose a fewer number of predictors? Analysis!

The Cars Outline Use JMP Analyze Multivariate Methods Principal Components

The Cars Outline Let us first look at the correlations between the variables. Figure: Correlations

The Cars Outline What about the principal components? Can we interpret them? Figure:

The Cars Outline How many principal components do we need? How much of the variation is explained?

Key Points Principal components are functions of the predictors The first few principal components can give us almost all the information in terms of the variability in the data

- Discussion To reduce the number of predictors As a first step for a predictive model where we would like to remove correlated variables General dimension reduction - expecting a low dimensional structure where higher dimensions are basically noise

- Disussions Difference between and The Problem Sometimes inherent structure of the data motivates the researcher to group the data based on some unseen underlying factors. This inherent structure can be identified through the correlation matrix of X.

- Disussions Difference between and The Subject Scores Problem Consider examination scores in 6 subjects for 220 male students. The 6 subjects are Latin, English, History, Arithmetic, Algebra and Geometry. Consider the correlation matrix for the scores. 1.000.439 1.000.410.351 1.000.288.354.164 1.000.329.320.190.595 1.000.248.329.181.470.464 1.000

- Disussions Difference between and The Problem The researcher believes that the subject scores will be correlated amongst themselves in groups. A possible hypothesis might be that there are probably two underlying factors for the students scores - a factor that captures the liberal arts scores and another that captures the science scores. But how to verify such a hypothesis?!

- Disussions Difference between and Factor Loadings For our problem the researcher thinks that there are two underlying factors. The underlying factors correspond to two different loadings on the 6 subjects. Latin = L 11 F 1 + L 12 F 2 + ɛ 1 English = L 21 F 1 + L 22 F 2 + ɛ 2... Geometry = L 61 F 1 + L 62 F 2 + ɛ 6 The loadings L s will hopefully help us interpret the factors.

- Disussions Difference between and The Approach Data has underlying factors researcher determines number of factors factor loadings to be obtained through the covariance matrix researcher interprets factors based on loadings

- Disussions Difference between and Factor Loadings for the Subject Scores Variable F 1 F 2 Communalities Latin.553.429.490 English.568.288.406 History.392.450.356 Arithmetic.740 -.273.623 Algebra.724 -.211.569 Geometry.595 -.132.372 The factor loadings do not give us any immediately identifiable groups or factor interpretation. Or DOES it? Communalities give a measure of how much of the variance of the variable is explained by the factor structure.

- Disussions Difference between and Factor Loadings Plot Figure: Plot of factor loadings with two factors for the scores example

- Disussions Difference between and The Factor Rotation The factors are not immediately identifiable What do we do now? Factor structure in terms of variance explained remains unchanged if we rotate the factors Lets rotate and see if the factor loadings become interpretable

- Disussions Difference between and Rotated Factor Loadings for the Subject Scores Variable F 1 F 2 Communalities Latin.369.594.490 English.433.467.406 History.211.558.356 Arithmetic.789 -.001.623 Algebra.752 -.054.569 Geometry.604 -.083.372 Rotation makes the two factors immediately identifiable

- Disussions Difference between and Rotated Factor Loadings Plot Figure: Plot of factor loadings with two factors for the scores example

- Disussions Difference between and Approach - Summary Decide on number of factors Obtain factor loadings for the variables Interpret factors If interpretation not obvious rotate factors and check loadings again

- Disussions Difference between and Psychometrics, Psychology, human factors - identify factors that explain a variety of results on different tests Marketing - Identify the salient attributes consumers use to evaluate products in this category. Physical sciences, geochemistry, ecology, and hydrochemistry

- Disussions Difference between and Differences Principal components capture most of the variability in data by using fewer dimensions that where the data exists Hence the principal components lie in the same space as data Factor analysis conceptually tries to search for underlying but unobserved factors that define the correlation in the data Hence factors lie in a different space than the data

Richard Johnson, Dean Wishern - Applied Multivariate Statistical Analysis, 5e