Data Analysis in Paleontology Using R. Data Manipulation

Similar documents
Multivariate Statistics 101. Ordination (PCA, NMDS, CA) Cluster Analysis (UPGMA, Ward s) Canonical Correspondence Analysis

DIMENSION REDUCTION AND CLUSTER ANALYSIS

Multivariate Statistics Summary and Comparison of Techniques. Multivariate Techniques

Indirect Gradient Analysis

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

An Introduction to R for the Geosciences: Ordination I

Ordination & PCA. Ordination. Ordination

4/4/2018. Stepwise model fitting. CCA with first three variables only Call: cca(formula = community ~ env1 + env2 + env3, data = envdata)

EDAMI DIMENSION REDUCTION BY PRINCIPAL COMPONENT ANALYSIS

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

Data Exploration and Unsupervised Learning with Clustering

4/2/2018. Canonical Analyses Analysis aimed at identifying the relationship between two multivariate datasets. Cannonical Correlation.

2/7/2018. Strata. Strata

1.3. Principal coordinate analysis. Pierre Legendre Département de sciences biologiques Université de Montréal

4. Ordination in reduced space

Unconstrained Ordination

Multivariate analysis

Multivariate Analysis of Ecological Data using CANOCO

Short Answer Questions: Answer on your separate blank paper. Points are given in parentheses.

Dimensionality Reduction Techniques (DRT)

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Variations in pelagic bacterial communities in the North Atlantic Ocean coincide with water bodies

Clustering. CSL465/603 - Fall 2016 Narayanan C Krishnan

PCA Advanced Examples & Applications

Statistical Machine Learning

Analysis of community ecology data in R

Measurement, Scaling, and Dimensional Analysis Summer 2017 METRIC MDS IN R

Introduction to multivariate analysis Outline

Vector Space Models. wine_spectral.r

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

DETECTING BIOLOGICAL AND ENVIRONMENTAL CHANGES: DESIGN AND ANALYSIS OF MONITORING AND EXPERIMENTS (University of Bologna, 3-14 March 2008)

Experimental Design and Data Analysis for Biologists

stocks.pc <- prcomp(stocks) stocks.pc$rotation #b s.pc3<-(stocks.pc$sdev)^2 cumsum(s.pc3)/sum(s.pc3)

INTRODUCTION TO MULTIVARIATE ANALYSIS OF ECOLOGICAL DATA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Introduction to ordination. Gary Bradfield Botany Dept.

Lab 7. Direct & Indirect Gradient Analysis

Homework : Data Mining SOLUTIONS

7 Principal Components and Factor Analysis

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

Overview of clustering analysis. Yuehua Cui

Clusters. Unsupervised Learning. Luc Anselin. Copyright 2017 by Luc Anselin, All Rights Reserved

Homework : Data Mining SOLUTIONS

Dimensionality Reduction

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

Machine Learning (BSMC-GA 4439) Wenke Liu

2/19/2018. Dataset: 85,122 islands 19,392 > 1km 2 17,883 with data

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Principal Components Analysis. Sargur Srihari University at Buffalo

diversity(datamatrix, index= shannon, base=exp(1))

PRINCIPAL COMPONENTS ANALYSIS

14 Singular Value Decomposition

Chapter 11 Canonical analysis

k-means clustering mark = which(md == min(md)) nearest[i] = ifelse(mark <= 5, "blue", "orange")}

Clustering: K-means. -Applied Multivariate Analysis- Lecturer: Darren Homrighausen, PhD

An introduction to the picante package

Data Preprocessing Tasks

Algebra of Principal Component Analysis

Final Exam, Machine Learning, Spring 2009

Supplementary Information

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Multivariate analysis of genetic data: an introduction

An Introduction to Ordination Connie Clark

Correlation. January 11, 2018

BIO 682 Multivariate Statistics Spring 2008

Statistics Toolbox 6. Apply statistical algorithms and probability models

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

FINM 331: MULTIVARIATE DATA ANALYSIS FALL 2017 PROBLEM SET 3

G562 Geometric Morphometrics. Statistical Tests. Department of Geological Sciences Indiana University. (c) 2012, P. David Polly

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

Correspondence Analysis & Related Methods

G E INTERACTION USING JMP: AN OVERVIEW

Unsupervised machine learning

Machine Learning - MT Clustering

Principal Component Analysis (PCA) Theory, Practice, and Examples

Linear Dimensionality Reduction

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

PRINCIPAL COMPONENTS ANALYSIS

Preprocessing & dimensionality reduction

Analysis of Multivariate Ecological Data

Clustering VS Classification

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Principal component analysis

Distance Measures. Objectives: Discuss Distance Measures Illustrate Distance Measures

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

15 Singular Value Decomposition

The R Book. Chapter 2: Essentials of the R Language Session 12

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

EXAM PRACTICE. 12 questions * 4 categories: Statistics Background Multivariate Statistics Interpret True / False

Collinearity: Impact and Possible Remedies

7 Principal Component Analysis

Principal component analysis (PCA) for clustering gene expression data

Linear Algebra and Eigenproblems

Linking species-compositional dissimilarities and environmental data for biodiversity assessment

Applying cluster analysis to 2011 Census local authority data

Multivariate Data Analysis a survey of data reduction and data association techniques: Principal Components Analysis

CSE446: Clustering and EM Spring 2017

Multivariate Analysis of Ecological Data

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Transcription:

Data Analysis in Paleontology Using R Session 3 24 Jan 2006 Gene Hunt Dept. of Paleobiology NMNH, SI Data Manipulation Sorting x <- c(3,1,10) sort(x) # sorts vector (does not replace x) order(x) # gives sorted indices rank(x) # gives ranks (averages ties) Selecting subset(data) # subsets cases/variables of a dataframe which(tf) # gives indices for which tf is TRUE which(valve.length<600) # indices of small populations Combining rbind() # combines rows of vector/matrix cbind() # combines columns of vector/matrix merge(d1, d2) # dataframe join (like databases) 1

Tabulating Data Manipulation table(f1) # counts per unique value of f1 table(f1, f2) # cross tabulation data(mtcars) # car data again attach(mtcars) table (cyl) table (cyl, am) # tabulate number cylinders vs. # transmission type (am=1 is auto) xtabs(formula) # does same with formula interface # also useful if data are already cross-tabulated # (e.g., database query output) # has summary() and plot() methods Manipulating Matrices Often want to do an operation on every row or column of a matrix apply(x, MARGIN, FUN, ) # MARGIN=1 by row; MARGIN=2 is by column # FUN is function to be applied # are arguments for FUN (if needed) # returns a vector or matrix of results apply(x, MARGIN=1,FUN=sum) # row totals apply(x, 2, sum) # column totals apply(x, 2, mean) # column (variable) means 2

Manipulating Matrices Can apply functions split by a factor tapply(x, INDEX, FUN, ) # x is a vector # INDEX is factor(s) to split by # FUN is function to be applied # arguments for FUN (if needed) attach(cope) # cope dataset tapply(valve.length, INDEX=species, FUN=mean) # mean valve.length separately by species See also: sapply(), lapply() Looping Sometimes need to explicitly loop through series of operations Usually use function for() for (i in 1:10) print (i) for (i in c(5,4,237)) print (i) # try this for (i in 1:10) { # multiple expressions need to be grouped # within braces {} } nr <- nrow(x) row.tot <- array(dim=nr) for (i in 1:nr) row.tot[i]<- sum(x[i,]) # number of rows in X # create vector # same as apply(x,1,sum) 3

Exercise 10. Data Manipulation & Looping 1. Use a loop to gather the data necessary to draw a rarefaction curve of the first sample (row) of the BCI dataset in the vegan package. (hints: 1. find out how many individuals are in this sample, and then generate a sequence of sample sizes you would like to sample (ns) using seq(). 2. Create an array to hold the rarefied diversities (rrich) as they are computed. 3. Loop through 1:length(ns), calling rarefy() and saving the result to the appropriate element of rrich.) 2. Plot the resulting rarefaction curve, giving informative labels to the x and y axes. What are the relationships among samples/variables? Questions like: Which are most similar? Are there underlying groups or gradients? Do patterns relate to extrinsic variables? Sometimes divided into two types: R-mode: among variables (colums) Q-mode: among samples (rows) 4

What are the relationships among samples/variables? Problem: complicated--too many variables, many of which are correlated Solution: distill signal into fewer axes that (hopefully) preserve major features of data structure Usually rely on three steps 1. Transform data (if desired) 2. Calculate distance matrix 3. Reduce dimensionality using cluster or ordination of distances Data Transformation Data matrix = X 1. Transform whole matrix e.g. log(x), sqrt(x), asin(x) 2. Transform matrix rows - can loop through rows using for(), or use apply() - special case: absolute to relative abundances prop.table(x, margin=1) # X must be a matrix # if not, use as.matrix(x) 5

Data Transformation Data matrix = X 3. Transform matrix columns - can loop through columns using for(), or use apply() - special case: mean-centering and standardization scale(x, center=true) # mean-centers scale(x, center=true, scale=true) # also sd=1 decostand(x, method=) does most ecological transformations method= total # converts to relative abundance method= pa # converts to presence-absence Computing Distances Functions dist(x, method= euclidean ) vegdist(x, method= bray ) # in vegan package Lots of different metrics available (method argument) Euclidean often used in morphometrics Bray-Curtis common (among lots of others) in ecology d Bray = # i # i x i " y i ( ) x i + y i d Euc = # i ( ) 2 x i " y i x and y are vectors (rows of data matrix) 6

S. Wing s Data Floral data from one site (BCR, ~71 Ma) In situ preservation, 100 sites along 4km outcrop Counts for 130 taxa; sed/environmental var s sw <- read.table(file= BCRDATA.txt, header=t) Columns 1:130 131:137 138:147 Type of data Taxon abundances Debris abundances Sed/env variables X <- sw[,1:130] carb <- sw$eperc.carb # floral data only # %carbon S. Wing s Data Check out data site.tot <- apply(x,1,sum) tax.tot <- apply(x,2,sum) # site total abundances # taxon total abundances range(site.tot) 100 753.5 # not too bad hist(log10(tax.tot), col= tan ) Transform data, compute distance matrix Xt <- sqrt(x) # sqrt transform, reduces effect # of dominant taxa Dbr<- vegdist(xt, method= bray ) 7

Cluster Analysis Hierarchical, agglomerative methods are generally used w.cl<- hclust(dbr, method= average ) #other methods available w.cl<- hclust(dbr, meth= aver ) #same as previous plot(w.cl) Properties of Cluster Analyses 1. Patterns reduced to 1-dimension 2. Short distances preserved, longer distances distorted 3. Yields hierarchical groupings whether present or not Cluster Analysis Cophenetic distances (implied by clustering) dist.cl <- cophenetic(w.cl) # returns distance matrix plot(dbr, dist.cl, xlab= True Distance, ylab= Distance Implied by Clustering ) abline(0,1, col= red ) # add line y=x There are other hierarchical and non-hierarchical clustering methods available, see: kmeans(), {cluster package}, {mclust package} 8

Principal Components Analysis (PCA) Common for biometric data, not in ecology New axes based on directions of maximal variance (cov or cor) PC axes direction defined by variable loadings (eigenvectors), length defined by eigenvalues (variance) x 2 PC scores computed for observations Eigenvectors PC1 PC2 x 1 0.68 0.73 x 2 0.73-0.68 x 1 PCA in R Function prcomp(x, scale.=false) # scale.=f uses COV Returns: sdev rotation x # scale.=t uses COR # square root of eigenvalues # eigenvector coefficients in col s # PC scores w.pca<- prcomp(xt, scale.=true) # use COR biplot(w.pca) # biplot (scaled scores & loadings) screeplot(w.pca, log= y ) # plot of eigenvalues 9

Principal Coordinates Analysis (PCO) Like PCA, based on eigenvectors Instead of COV or COR matrix, eigen decomposition is performed on a Gower-transformed distance matrix (Gower, 1966) Useful for mixtures of data types (interval, ordinal, categorical) Sometimes called classical or metric multidimensional scaling Common in morphospace studies In R, function cmdscale() w.pco<- cmdscale(dbr, k=2) # k is number of PCO axes Correspondence Analysis Simultaneous ordination of rows (sites) and columns (taxa) to maximize correlation btwn site and taxon ordinations. Preserves χ 2 distance Sometimes called Reciprocal Averaging Another variant constrains CA based on extrinsic (environmental) variables (CCA) In R, function cca() from vegan w.ca<- cca(xt) * There is also a function of the same name in package ade-4 10

Ordinations in vegan Plot method is a biplot (little control over options) Can use scores(w.ca) to extract scores More control fig<- ordiplot(w.ca, type= n ) points(fig, what= sites, pch=21, bg= grey, cex=1.5) points(fig, what= species, pch=22, cex=2, bg= red, select=tax.tot>200) Detrended Correspondence Analysis (DCA) Modified CA, designed to correct the arch effect Divided CA1 into segments and centers and scales within each segment Thought to be good at detecting a gradient In marine paleoecology, DCA1 is usually interpreted as depth w.dca <- decorana(xt) 11

Nonmetric Multidimensional Scaling (NMDS) Seeks reduced dimensionality ordination that maximizes rank-order correlation of inter-point distances (number of dimensions determined ahead of time) Iterative algorithm, measure of rank-order misfit is called stress Considered to be robust--aberrant samples do not dominate axes Scaling of axes is arbitrary (as is rotation and translation) NMDS in R w.mds <- isomds(dbr, k=2) # k is number of axes # returns $points, $stress symbols(w.mds$points, circles=carb, inches=0.3) See also: sammon() {MASS}, metamds() {vegan} What do NMDS axes mean? Can compute equivalents of loadings: the correlation coefficient btwn new (MDS) and original variables cor (Xt, w.mds$points) # mds loadings 12

ANOSIM Similar to ANOVA in that testing for differences across a priori groups (factors). Based on a distance matrix: convert pairwise distance to their ranks, and divides them into within-group and between-group distances. Test statistic, R, is based on (mean rank between groups) minus the (mean rank within groups). R can range from -1 to 1, but is almost always positive. Significance is assessed using a permutation test R = r B " r W N(N "1) /4 ANOSIM in R gg <- array(dim=100) # grouping variable gg[sw$edist > 3000] <- farend gg[sw$edist < 3000] <- nearend # lateral position aa <- anosim(dbr, gg, perm=500) plot(aa) Note: also could test association between Dbr and pairwise geographic distance using mantel() {vegan} sw$edist 13

Exercise 11. Cluster & Ordinations 1. Compare the clustering of Dbr based on average linkage to that with method= single. Do the overall shapes of the dendrograms differ? Different linkage algorithms produce characteristically different tree shapes. In particular, note tree balance (symmetry). 2. Compute the eigenvalues of the PCA results, and save them to a variable ev. Use this variable to make your own scree plot using plot(). Use type= o or b to get overlapping lines and symbols. Plot only the first 10 eigenvalues, and remember to log-scale the y-axis. 3. Using ev, compute the proportion of variance explained by the eigenvalues; save this to a variable pev. Use the function cumsum() to compute the cumulative proportion of variance explained, and call it cpev (check?cumsum). Use cbind() to make a matrix of the first ten elements of ev, pev and cpev. 4. Use the function scores() to extract the site scores from the DCA results. Use the function symbols() to plot the first two DCA axes, with circle size proportional to carb. Are the results consistent with the NMDS ordination? 5. Do the following: shep <- Shepard(Dbr, w.mds$points), and then plot shep (when there are lots of points, it sometimes helps to make them very small, as with cex=0.3). This is called a Shepard Plot, and shows true pairwise distance between points (on the x-axis) against the pairwise distance in the ordination space. In addition to the names components x and y (which are plotted), shep also has a named component yf, which is the monotonic function relating true and ordinating distances used in minimizing stress. Add this to the plot, using: lines(shep$x, shep$yf, col= red, type= S ). 6. How similar are the site ordinations in CA and DCA? One way to see the effects of the detrending is to compute the correlation coefficients between the site ordinations. Do this, using scores() to extract the site ordination information (note that scores() yields both site and species scores for CA, so be careful). Answers to Exercises [2] Exercise 7. Factors, types, and missing data 1. read.table(file= data.txt,header=t,na.strings= -999 ) 2. Sum(age>20) # 17 3. median(x, na.rm=true) Exercise 8. Statistical models 1. #note that the order of arguments for plot is reversed in model notation 2. plot(fitted(w), resid(w)) # note nonrandom pattern 3. # Note that larger symbols are mostly above, and small symbols mostly below, the line. 4. w.td<- lm(valve.length ~ mg.temp + depth) # both terms are significant, residuals look ok Exercise 9. Biodiversity and Additional Exercises 1. x<- BCI[1,]; px<- x/sum(x) 2. Normally: H<- -sum(px*log(px)) # returns NaN bc some of the px s are zero and log(0) is undefined. Instead: nz<- px>0 # T if px>0; H<- -sum(px[nz]*log(px[nz])) 3. rich<- specnumber(bci); fa<- fisher.alpha(bci); shan<- diversity(bci); simp<- diversity(bci, index= simp ) 4. fa & rich, and simp & shan, are highly correlated 5. The F-test is highly significant, indicating that w.td significantly improves the fit 14

Answers to Exercises [3] Exercise 10. Exercise 10. Data Manipulation & Looping 1. ns<- seq(10,480,10); rrich<- array(dim=length(ns)); for (i in 1:length(ns)) rrich[i]<- rarefy(bci[1,], ns[i]) 2. Plot(ns, rrich, type= o, xlab= Sample size, ylab= Rarefied richness ) Exercise 11. Cluster and Ordination 1. plot(hclust(dbr, meth= single )); Single linkage trees tend to be very imbalanced; average linkage trees tend to be relatively balanced). 2. ev<- w.pca$sd^2; plot(ev[1:10], type= o, log= y ) 3. pev<- ev/sum(ev); cpev<- cumsum(pev); cbind(ev[1:10], pev[1:10], cpev[1:10]) 4. ds<- scores(w.dca); symbols(ds[,1], ds[,2], circle=carb, inches=.3); #carb is also correlatd with the DCA ordination. 5. plot(shep, pch=19, cex=0.3); note that rank order, not linear correlation is minimized 6. ds<- scores(w.dca); cs<- scores(w.ca); cor(ds, cs$sites); note that the first axes are almost identical (except for sign, which is arbitrary), but none of the other axes are especially correlated. 15