Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis For example Data reduction approaches Cluster analysis Principal components analysis Principal coordinates analysis Multidimensional scaling Hypothesis testing approaches Discriminant analysis MANOVA ANOSIM Canonical correlation PERMANOVA

Objects Things we wish to compare sampling or experimental units e.g. quadrats, animals, plants, cages etc. Variables Characteristics measured from each object usually continuous variables e.g. counts of species, size of body parts etc.

Ecological data Objects: sampling units (SU s, e.g. quadrats, plots etc.) Variables: species abundances and/or environmental data Common in community ecology Wisconsin forests (Peet & Loucks 1977) Plots (quadrats) in Wisconsin forests Number of individuals of each species of tree recorded in each quadrat Objects: quadrats Variables: abundances of each tree species

Data Plot Bur oak Black oak White oak Red oak etc. 1 9 8 5 3 2 8 9 4 4 3 3 8 9 0 4 5 7 9 6 5 6 0 7 9 6 0 0 7 8 etc. Garroch Head dumping ground (Clarke & Ainsworth 1993) Sewage sludge dumping ground in bay Transect across dumping ground Core of mud at each of 10 stations along transect Objects: stations Variables: metal concentrations in ppm

Data Station Cu Mn Co Ni Zn Cd etc. 1 26 2470 14 34 160 0 2 30 1170 15 32 156 0.2 3 37 394 12 38 182 0.2 4 74 349 12 41 227 0.5 5 115 317 10 37 329 2.2 etc. Morphological data Objects: usually organisms or specimens Variables: morphological measurements

Morphological data Morphological variation between dog species/types Objects: dog types (7) Variables: sizes of 6 different parts of mandible mandible breadth, mandible height, etc. Data Variable Dog type 1 2 3 4 5 6 Modern dog 9.7 21.0 19.4 7.7 32.0 36.5 Jackal 8.1 16.7 18.3 7.0 30.3 32.9 Chinese wolf 13.5 27.3 26.8 10.6 41.9 48.1 Indian wolf 11.5 24.3 24.5 9.3 40.0 44.6 Cuon 10.7 23.5 21.4 8.5 28.8 37.6 Dingo 9.6 22.6 21.1 8.3 34.4 43.1 Prehistoric dog 10.3 22.1 19.1 8.1 32.3 25.0

Presentation of Multivariate Data Hard to visualize complex (more than 3 dimensions) multivariate datasets For example, how do you visualize 7 attributes of a dog skull Easier to visualize relationships between objects (e.g. similarity, dissimilarity, correlation, scaled distance) Presentation of Multivariate Data Ordination Raw data matrix Resemblance matrix x x x x V1 V2.......... Vn O1 O2.. Op x x x x x O1 O1 O2 O2... Op. Op Classification created using correlations, covariances or dissimilarity indices

Principal Components Analysis Aims to reduce large number of variable to smaller number of summary variables called Principal Components (or factors), that explain most of the variation in the data. Is basically a rotation of axes after centering to the means of the variables, the rotated axes being the Principal Components. Is usually carried out using a matrix algebra technique called eigenanalysis. Regression Least squares (OLS) estimation, allows best prediction of Y given X (minimize distance in y direction to line) Y y i y i y y } y i y i residual Observed y Predicted y least squares regression line x x i X

PCA association among variables (minimize distance to line in both x and y directions) Y y i Observed y y x x i X Comparison Y Component 1 (Factor 1) y Regression line (Y on X) x X

PCA association among variables (minimize distance to line in both x and y directions) Y y i y Principal component 1 (Factor 1) x x i X Can be done in N dimensions Maximum # PC s = Original Variables-1 PC1 PC2

Steps in PCA 1) From raw data matrix, calculate correlation matrix, or covariance matrix on standardized variables Site 1 Site 2 Site 3 : : NO 3 Total Total N.... Organic N NO 3 TON TN NO 3 TON TN 1 0.37 1 0.84 0.13 1 Steps in PCA 2) Calculate eigenvectors (weightings of each original variable on each component) and eigenvalues (= "latent roots") (relative measures of the variation explained by each component)

Eigenvalue Eigenvectors z ik = c 1 y i1 + c 2 y i2 +.. c j y ij +.. + c p y ip Where z ik = score for component k for object i y i = value of original variable for object i c j = factor score coefficient (weight) of variable for component k Example: soil chemistry in a forest z ik = c 1 (NO 3 ) + c 2 (total organic N) + c 3 (total N) +.. the objects are sampling sites the variables are chemical measurements, e.g. total N Steps in PCA - continued 3) Decide how many components to retain (scree plot of eigenvalues) 5 4 3 2 1 0 1 2 3 4 5 6 7 8 Factor Eigenvalue of 1 means the Factor explains as much variation in the dataset as an original variable. Values greater than 1 indicate useful Factors

FACTOR(2) Steps in PCA 4) Using factor score coefficients, calculate factor score = coefficient x (standardized) variable Steps in PCA 5) Position objects on scatterplot, using factor scores on first two (or three) Principal Components 3 2 1 0-1 Site 2 Site 1 Site 3-2 -3-2 -1 0 1 2 3 FACTOR(1)

Original Variable What are loadings? Correlations of original data and Factors (r s ) For example the correlation between variable X and Factor 1 Correlations range from +1 to 1 +1 indicates strong positive relationship with NO scatter around line -1 indicates strong negative relationship with no scatter around line Interpretation of r (correlation coefficient) r = 1, r 2 =1 r =.77, r 2 =.59 r = 0, r 2 =0 r = -1, r 2 =1 r = -.77, r 2 =.59 r = 0, r 2 =0 Factor 1

Using ourworld Worked example Variables sampled are Population in 1983, 1986 and 1990, military spending, Gross National Product, birth rate in 1982, death rate in 1982 (7 total) Can these variables be reduced into fewer composite factors Multiply Raw Data by coefficients to get factor scores Raw Data Case POP83 POP86 POP90 Birth82 Death82 GNP Mil 1 3.4 3.6 3.500212 20 9 5150 95.83333 2 7.5 7.6 7.644275 12 12 9880 127.2368 Factor Coefficients Case 1, Factor 1= 3.4 (.560)+3.6 (.564) + 3.5 (.566) + 20 (. 114) + 9 (.086) + 5150 (-. 130) + 95.83 (-.092) Case 1, Factor 2= 3.4 (.141)+3.6 (.123) + 3.5 (.104) + 20 (-.520) + 9 (-.326) + 5150 (. 574) + 95.83 (.495)

Determine how many components (composite factors) to retain ~80% of variance explained by 2 (of 7) components Using PCA Run simple PCA, no rotation Examine loadings correlations between factors and original variables

Rotation - Varimax PCA - ourworld What have we found out The seven examined variables can be reduced to 2 and still retain ~ 80% of original information What we have not found out Any relationships with predictor variables Remember PCA is a data reduction NOT hypothesis testing technique Can it be used to examine hypotheses? Overlay predictor groups on Factor Plots For example is there a relationship between the Factor scores and Urban (Urban, City) or Group (Europe, Islamic or New World)

Any contribution of Factor 1? 2 1 FACTOR(2) 0-1 -2-2 -1 0 1 2 3 4 FACTOR(1) GROUP Europe Islamic NewWorld Any contribution of Factor 1? 2 1 FACTOR(2) 0-1 -2-2 -1 0 1 2 3 4 FACTOR(1) URBAN city rural