Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Similar documents
Unconstrained Ordination

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

Principal Component Analysis (PCA) Theory, Practice, and Examples

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Experimental Design and Data Analysis for Biologists

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Vector Space Models. wine_spectral.r

BIO 682 Multivariate Statistics Spring 2008

Multivariate Statistics (I) 2. Principal Component Analysis (PCA)

Principal component analysis

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Dimensionality Reduction Techniques (DRT)

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Multivariate and Multivariable Regression. Stella Babalola Johns Hopkins University

Dimension Reduction and Classification Using PCA and Factor. Overview

Principal Components Analysis (PCA)

ANOVA approach. Investigates interaction terms. Disadvantages: Requires careful sampling design with replication

Linear Dimensionality Reduction

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

Introduction to multivariate analysis Outline

Chapter 4: Factor Analysis

Multivariate Analysis of Ecological Data using CANOCO

An Introduction to Multivariate Methods

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

STATISTICAL LEARNING SYSTEMS

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Canonical Correlations

Chapter 11 Canonical analysis

STAT 730 Chapter 1 Background

UCLA STAT 233 Statistical Methods in Biomedical Imaging

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

2/26/2017. This is similar to canonical correlation in some ways. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Principal Component Analysis!! Lecture 11!

Principal Component Analysis

Principal Component Analysis, A Powerful Scoring Technique

Canonical Correlation & Principle Components Analysis

Machine Learning 2nd Edition

G562 Geometric Morphometrics. Statistical Tests. Department of Geological Sciences Indiana University. (c) 2012, P. David Polly

Factor Analysis Continued. Psy 524 Ainsworth

MULTIVARIATE HOMEWORK #5

Machine Learning (Spring 2012) Principal Component Analysis

Data Preprocessing Tasks

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

Statistics: A review. Why statistics?

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Statistics II 1. Modelling Biology. Basic Applications of Mathematics and Statistics in the Biological Sciences

LECTURE 4 PRINCIPAL COMPONENTS ANALYSIS / EXPLORATORY FACTOR ANALYSIS

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

Principal Component Analysis

What is Principal Component Analysis?

Principal Component Analysis & Factor Analysis. Psych 818 DeShon

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Table of Contents. Multivariate methods. Introduction II. Introduction I

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

PCA Advanced Examples & Applications

PRINCIPAL COMPONENT ANALYSIS

Principal Component Analysis-I Geog 210C Introduction to Spatial Data Analysis. Chris Funk. Lecture 17

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Factor Analysis. Statistical Background. Chapter. Herb Stenson and Leland Wilkinson

Inter Item Correlation Matrix (R )

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Multivariate Statistical Analysis

Exploratory Factor Analysis and Principal Component Analysis

Discrimination Among Groups. Discrimination Among Groups

An Introduction to Ordination Connie Clark

Dimension Reduction and Low-dimensional Embedding

Bootstrapping, Randomization, 2B-PLS

Principal Component Analysis (PCA) Principal Component Analysis (PCA)

Machine Learning 11. week

Principal component analysis

Exploratory Factor Analysis and Principal Component Analysis

Image Analysis. PCA and Eigenfaces

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Multivariate analysis

PCA, Kernel PCA, ICA

Algebra of Principal Component Analysis

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Lecture Topic Projects 1 Intro, schedule, and logistics 2 Applications of visual analytics, data types 3 Data sources and preparation Project 1 out 4

Course in Data Science

Experimental design. Matti Hotokka Department of Physical Chemistry Åbo Akademi University

Wolfgang Karl Härdle Leopold Simar. Applied Multivariate. Statistical Analysis. Fourth Edition. ö Springer

THE UNIVERSITY OF CHICAGO Graduate School of Business Business 41912, Spring Quarter 2014, Mr. Ruey S. Tsay. Solutions to Final Exam

G E INTERACTION USING JMP: AN OVERVIEW

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

Linear Algebra Methods for Data Mining

Applied Multivariate Analysis

Intensity Analysis of Spatial Point Patterns Geog 210C Introduction to Spatial Data Analysis

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

PCA and admixture models

Principal component analysis, PCA

MultiDimensional Signal Processing Master Degree in Ingegneria delle Telecomunicazioni A.A

Applied Multivariate Analysis

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

6. Let C and D be matrices conformable to multiplication. Then (CD) =

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Transcription:

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis For example Data reduction approaches Cluster analysis Principal components analysis Principal coordinates analysis Multidimensional scaling Hypothesis testing approaches Discriminant analysis MANOVA ANOSIM Canonical correlation PERMANOVA

Objects Things we wish to compare sampling or experimental units e.g. quadrats, animals, plants, cages etc. Variables Characteristics measured from each object usually continuous variables e.g. counts of species, size of body parts etc.

Ecological data Objects: sampling units (SU s, e.g. quadrats, plots etc.) Variables: species abundances and/or environmental data Common in community ecology Wisconsin forests (Peet & Loucks 1977) Plots (quadrats) in Wisconsin forests Number of individuals of each species of tree recorded in each quadrat Objects: quadrats Variables: abundances of each tree species

Data Plot Bur oak Black oak White oak Red oak etc. 1 9 8 5 3 2 8 9 4 4 3 3 8 9 0 4 5 7 9 6 5 6 0 7 9 6 0 0 7 8 etc. Garroch Head dumping ground (Clarke & Ainsworth 1993) Sewage sludge dumping ground in bay Transect across dumping ground Core of mud at each of 10 stations along transect Objects: stations Variables: metal concentrations in ppm

Data Station Cu Mn Co Ni Zn Cd etc. 1 26 2470 14 34 160 0 2 30 1170 15 32 156 0.2 3 37 394 12 38 182 0.2 4 74 349 12 41 227 0.5 5 115 317 10 37 329 2.2 etc. Morphological data Objects: usually organisms or specimens Variables: morphological measurements

Morphological data Morphological variation between dog species/types Objects: dog types (7) Variables: sizes of 6 different parts of mandible mandible breadth, mandible height, etc. Data Variable Dog type 1 2 3 4 5 6 Modern dog 9.7 21.0 19.4 7.7 32.0 36.5 Jackal 8.1 16.7 18.3 7.0 30.3 32.9 Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1 Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6 Cuon 10.7 23.5 21.4 8.5 28.8 37.6 Dingo 9.6 22.6 21.1 8.3 34.4 43.1 Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Presentation of Multivariate Data Hard to visualize complex (more than 3 dimensions) multivariate datasets For example, how do you visualize 7 attributes of a dog skull Easier to visualize relationships between objects (e.g. similarity, dissimilarity, correlation, scaled distance) Presentation of Multivariate Data Ordination Raw data matrix Resemblance matrix x x x x V1 V2.......... Vn O1 O2.. Op x x x x x O1 O1 O2 O2... Op. Op Classification created using correlations, covariances or dissimilarity indices

Principal Components Analysis Aims to reduce large number of variable to smaller number of summary variables called Principal Components (or factors), that explain most of the variation in the data. Is basically a rotation of axes after centering to the means of the variables, the rotated axes being the Principal Components. Is usually carried out using a matrix algebra technique called eigenanalysis. Regression Least squares (OLS) estimation, allows best prediction of Y given X (minimize distance in y direction to line) Y y i y i y y } y i y i residual Observed y Predicted y least squares regression line x x i X

PCA association among variables (minimize distance to line in both x and y directions) Y y i Observed y y x x i X Comparison Y Component 1 (Factor 1) y Regression line (Y on X) x X

PCA association among variables (minimize distance to line in both x and y directions) Y y i y Principal component 1 (Factor 1) x x i X Can be done in N dimensions Maximum # PC s = Original Variables-1 PC1 PC2

Steps in PCA 1) From raw data matrix, calculate correlation matrix, or covariance matrix on standardized variables Site 1 Site 2 Site 3 : : NO 3 Total Total N.... Organic N NO 3 TON TN NO 3 TON TN 1 0.37 1 0.84 0.13 1 Steps in PCA 2) Calculate eigenvectors (weightings of each original variable on each component) and eigenvalues (= "latent roots") (relative measures of the variation explained by each component)

Eigenvalue Eigenvectors z ik = c 1 y i1 + c 2 y i2 +.. c j y ij +.. + c p y ip Where z ik = score for component k for object i y i = value of original variable for object i c j = factor score coefficient (weight) of variable for component k Example: soil chemistry in a forest z ik = c 1 (NO 3 ) + c 2 (total organic N) + c 3 (total N) +.. the objects are sampling sites the variables are chemical measurements, e.g. total N Steps in PCA - continued 3) Decide how many components to retain (scree plot of eigenvalues) 5 4 3 2 1 0 1 2 3 4 5 6 7 8 Factor Eigenvalue of 1 means the Factor explains as much variation in the dataset as an original variable. Values greater than 1 indicate useful Factors

FACTOR(2) Steps in PCA 4) Using factor score coefficients, calculate factor score = coefficient x (standardized) variable Steps in PCA 5) Position objects on scatterplot, using factor scores on first two (or three) Principal Components 3 2 1 0-1 Site 2 Site 1 Site 3-2 -3-2 -1 0 1 2 3 FACTOR(1)

Original Variable What are loadings? Correlations of original data and Factors (r s ) For example the correlation between variable X and Factor 1 Correlations range from +1 to 1 +1 indicates strong positive relationship with NO scatter around line -1 indicates strong negative relationship with no scatter around line Interpretation of r (correlation coefficient) r = 1, r 2 =1 r =.77, r 2 =.59 r = 0, r 2 =0 r = -1, r 2 =1 r = -.77, r 2 =.59 r = 0, r 2 =0 Factor 1

Using ourworld Worked example Variables sampled are Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total) Can these variables be reduced into fewer composite factors Multiply Raw Data by coefficients to get factor scores Raw Data Case POP83 POP86 POP90 Birth82 Death82 GNP Mil 1 3.4 3.6 3.500212 20 9 5150 95.83333 2 7.5 7.6 7.644275 12 12 9880 127.2368 Factor Coefficients Case 1, Factor 1= 3.4 (.560)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092) Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)

Determine how many components (composite factors) to retain ~80% of variance explained by 2 (of 7) components Using PCA Run simple PCA, no rotation Examine loadings correlations between factors and original variables

Rotation - Varimax PCA - ourworld What have we found out The seven examined variables can be reduced to 2 and still retain ~ 80% of original information What we have not found out Any relationships with predictor variables Remember PCA is a data reduction NOT hypothesis testing technique Can it be used to examine hypotheses? Overlay predictor groups on Factor Plots For example is there a relationship between the Factor scores and Urban (Urban, City) or Group (Europe, Islamic or New World)

Any contribution of Factor 1? 2 1 FACTOR(2) 0-1 -2-2 -1 0 1 2 3 4 FACTOR(1) GROUP Europe Islamic NewWorld Any contribution of Factor 1? 2 1 FACTOR(2) 0-1 -2-2 -1 0 1 2 3 4 FACTOR(1) URBAN city rural