Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Similar documents
Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Statistics

Multivariate Statistics

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

New Educators Campaign Weekly Report

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Multivariate Statistics

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world.

Challenge 1: Learning About the Physical Geography of Canada and the United States

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018

, District of Columbia

Multivariate Analysis

Hourly Precipitation Data Documentation (text and csv version) February 2016

Additional VEX Worlds 2019 Spot Allocations

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue

Printable Activity book

Abortion Facilities Target College Students

Preview: Making a Mental Map of the Region

A. Geography Students know the location of places, geographic features, and patterns of the environment.

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

2005 Mortgage Broker Regulation Matrix

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Office of Special Education Projects State Contacts List - Part B and Part C

North American Geography. Lesson 2: My Country tis of Thee

Online Appendix: Can Easing Concealed Carry Deter Crime?

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

JAN/FEB MAR/APR MAY/JUN

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

Osteopathic Medical Colleges

Intercity Bus Stop Analysis

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008

Alpine Funds 2016 Tax Guide

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases

Summary of Natural Hazard Statistics for 2008 in the United States

Alpine Funds 2017 Tax Guide

Meteorology 110. Lab 1. Geography and Map Skills

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage]

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

MINERALS THROUGH GEOGRAPHY

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

An Analysis of Regional Income Variation in the United States:

OUT-OF-STATE 965 SUBTOTAL OUT-OF-STATE U.S. TERRITORIES FOREIGN COUNTRIES UNKNOWN GRAND TOTAL

High School World History Cycle 2 Week 2 Lifework

Statistical Methods for Data Mining

JAN/FEB MAR/APR MAY/JUN

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3

Insurance Department Resources Report Volume 1

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t

GIS use in Public Health 1

Non-iterative, regression-based estimation of haplotype associations

extreme weather, climate & preparedness in the american mind

National Organization of Life and Health Insurance Guaranty Associations

FLOOD/FLASH FLOOD. Lightning. Tornado

Pima Community College Students who Enrolled at Top 200 Ranked Universities

United States Geography Unit 1

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

112th U.S. Senate ACEC Scorecard

Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania (T) Michigan State University

A Summary of State DOT GIS Activities

KS PUBL 4YR Kansas State University Pittsburg State University SUBTOTAL-KS

MO PUBL 4YR 2090 Missouri State University SUBTOTAL-MO

Introducing North America

Package ZIM. R topics documented: August 29, Type Package. Title Statistical Models for Count Time Series with Excess Zeros. Version 1.

This project is supported by a National Crime Victims' Right Week Community Awareness Project subgrant awarded by the National Association of VOCA

Evidence for increasingly extreme and variable drought conditions in the contiguous United States between 1895 and 2012

Infant Mortality: Cross Section study of the United State, with Emphasis on Education

HP-35s Calculator Program Lambert 1

A GUIDE TO THE CARTOGRAPHIC PRODUCTS OF

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

All-Time Conference Standings

My Background 1/6/2011 PLACES I HAVE BEEN TO STUDY GEOLOGY

CCC-A Survey Summary Report: Number and Type of Responses

15 Singular Value Decomposition

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Fungal conservation in the USA

HP-35s Calculator Program Lambert 2

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

Physical Features of Canada and the United States

Stem-and-Leaf Displays

Physical Features of Canada and the United States

October 2016 v1 12/10/2015 Page 1 of 10

Division I Sears Directors' Cup Final Standings As of 6/20/2001

PHYSICS BOWL - APRIL 22, 1998

William Battye * EC/R Incorporated, Chapel Hill, NC. William Warren-Hicks, Ph.D. EcoStat, Inc., Mebane, NC

Mathematical foundations - linear algebra

DOWNLOAD OR READ : USA PLANNING MAP PDF EBOOK EPUB MOBI

If you have any questions concerning this report, please feel free to contact me. REPORT OF LABORATORY ANALYSIS

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Algebra of Principal Component Analysis

14. Where in the World is Wheat?

Transcription:

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially present/abundant in different samples? Is this taxon correlated with a given continuous variable? What if we would like to draw conclusions about the community as a whole? 2

Useful ideas from modern statistics Distances between anything (abundances, presence-absence, graphs, trees); Direct hypotheses based on distances; Decompositions through iterative structuration; Projections; Randomization tests, probabilistic simulations. 3 Data à Distances à Statistics S-1 S-2 S-16 0.54 OTU-1 OTU-2 OTU-3 OTU-4 OTU-5. OTU-10,000 0.54 0.34 0.60 0.17 0.10 0.01 0.75 0.17 0.72 0.24 0.34 0.74 0.89 0.58 0.08 0.43 0.82 0.91 0.69 0.64 0.57 0.06 0.59 0.03 0.69 0.89 0.07 0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26 0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30 0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59 0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65 0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19 0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45 0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14 0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83 Visualization (Principal coordinates analysis) Statistical hypothesis testing (PERMANOVA) 4

What is a distance metric? Scalar function d(.,.) of two arguments d(x, y) >= 0, always nonnegative; d(x, x) = 0, distance to self is 0; d(x, y) = d(y, x), distance is symmetric; d(x, y) <= d(x, z) + d(z, y), triangle inequality. 5 WHAT ARE SOME GOOD DISTANCE METRICS? 6

Using distances to capture multidimensional heterogeneous information A good distance will enable us to analyze any type of data usefully We can build specialized distances that incorporate different types of information (abundance, trees, geographical locations, etc.) We can visualize complex data as long as we know the distances between objects (observations, variables) We can compute distances (correlations) between distances to compare them We can decompose the sources of variability contributing to distances in ANOVA-like fashion 7 Distance and similarity Sometimes it is conceptually easier to talk about similarities rather than distances E.g. sequence similarity Any similarity measure can be converted into a distance metric, e.g. S If S is (0, 1), D=1-S If S>0, D = 1/S or D = exp(-s) 8

A few useful distances and similarity indices Distances: Euclidean: (remember Pythagoras theorem) Σ(x i -y i ) 2 Weighted Euclidean: χ 2 = Σ(e i - o i ) 2 /e i Hamming/L1, Bray Curtis = Σ 1 {xi=yi} Unifrac (later) Jensen-Shannon: (D(X M) + D(Y M))/2, where M = (X + Y)/2 Kullback-Leibler divergence: D(X Y) = Σ ln[x i /y i ]x i Similarity: Correlation coefficient Matching coefficient: (f 11 +f 00 )/(f 11 + f 10 + f 01 + f 00 ) Jaccard Similarity Index: f 11 /(f 11 + f 10 + f 01 ) x\y 1 0 1 f 11 f 10 0 f 01 f 00 9 Unifrac distance (Lozupone and Knight, 2005) Is a distance between groups of organisms related by a tree Definition: Ratio of the sum of the length of the branches leading to sample X or Y, but not both, to the sum of all branch lengths of the tree. 10

Weighted Unifrac (Lozuponeet al., 2007) =1 = 11 Key warnings about the Unifrac family of distances. The scaling of the tree is important. The rooting of the tree is important. 12

Garbage in, garbage out A note of warning! Wrong normalization => wrong distance => wrong answer However, given the many choices there isn t much beyond prior knowledge, experience and intuition to guide in selection of the distance. 13 PRINCIPAL COORDINATES ANALYSIS - MULTIDIMENSIONAL SCALING 14

v 2 v 1 v 3 1 2 3 v v v é ù ê ú = ê ú ê ú ë û v 15 Every multivariate sample can be represented as a vector in some vector space 16 4 5 6 7 8 1 2 3 4 5 6 7 2.0 2.5 3.0 3.5 4.0 4.5 iris$sepal.length iris$sepal.width iris$petal.length

Vector Basis A basis is a set of linearly independent (dot product is zero) vectors that span the vector space. Spanning the vector space: Any vector in this vector space may be represented as a linear combination of the basis vectors. The vectors forming a basis are orthogonal to each other. If all the vectors are of length 1, then the basis is called orthonormal. 17 iris$petal.length 1 2 3 4 5 6 7 4 5 6 7 8 2.0 2.5 3.0 3.5 4.0 4.5 iris$sepal.width pca$li$axis2 3 2 1 0 1 2 3 0.8 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.8 4 5 6 7 8 9 10 11 pca$li$axis3 iris$sepal.length pca$li$axis1 18

Basic idea for analysis of multidimensional data Compute distances Reduce dimensions Embed in Euclidean space The general framework behind this process is called Duality diagram: (X nxp, Q pxp, D nxn ) X nxp (centered) data matrix Q pxp column weights (weights on variables) D nxn row weights (weights on observations) 19 x Q? Duality diagram defines the geometry of multivariate analysis R p! R n X? x??y???? y V D W R p X t R n duality diagram because knowledge V = X T DX W = XQX T Duality: The eigen decomposition of VQ leads to eigen-decomposition of WD Inertia is equal to trace (sum of the diagonal elements) of VQ or WD. 20

pc.cr Comp.1 Comp.2 Comp.3 Comp.4 Principal Component Analysis (PCA) Let Q = I and D = 1/n I and let X be centered. VQ = X T DXQ = 1/n X T X. The inertia Tr(VQ) = sum of the variances. PCA decomposes the variance of X into independent components. To decompose the inertia means to find the eigen-system of VQ or equivalently WD matrices. Eigenvalues give the amount of inertia explained in corresponding dimension. Eigenvectors give the dimensions of variability. 21 Example PCA head(usarrests) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Variances 0.0 0.5 1.0 1.5 2.0 Screeplot: plot of inertia Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder -0.536 0.418-0.341 0.649 Assault -0.583 0.188-0.268-0.743 UrbanPop -0.278-0.873-0.378 0.134 Rape -0.543-0.167 0.818 Comp.2 0.2 0.1 0.0 0.1 0.2 0.3 5 0 5 Mississippi North Carolina Biplot South Carolina West VirginiaVermont Georgia Alabama Alaska Arkansas Kentucky Murder Louisiana Tennessee South Dakota North Dakota Montana Maryland Assault Wyoming Maine Virginia Idaho New Mexico Florida New Hampshire Michigan Iowa Indiana Nebraska Missouri Kansas Rape Delaware Oklahoma Texas Oregon Pennsylvania Minnesota Wisconsin Illinois Nevada Arizona Ohio New York Colorado Washington Connecticut New Jersey Massachusetts Utah Rhode Island California Hawaii UrbanPop 0.2 0.1 0.0 0.1 0.2 0.3 Comp.1 5 0 5 22

Centering Let Y be not centered data matrix with n observations (rows) and p variables (columns) Let H = (I 1/n 1x1 ) Then X = HY is centered 23 From Euclidean distances to PCA to PCoA Note that if D is a Euclidean distance, then X X = 1/n H D (2) H. PCoA is a generalization of PCA in that knowledge of X is not required, all you need to represent the points is D, the interpoint distance matrix. 24

Representation of (arbitrary) distances in Euclidean space The idea is to use singular value decomposition (SVD) on the centered interpoint distance matrix to extract Euclidean dimensions SVD: X = U S V, where S is diagonal matrix with diagonal elements s 1, s 2,, s n, and U and V are unit matrices (i.e. their determinant is 1 and they span their corresponding spaces) 25 PCoA details Algorithm starting from D inter-point distances: Center the rows and columns of the matrix of square (element-byelement) distances: S = -1/2H D (2) H Compute SVD by diagonalizing S, S = U Λ U T Extract Euclidean representations: X = U Λ 1/2 The relative values of diagonal elements of Λ gives the proportion of variability explained by each of the axes. The values of Λ should always be looked at in deciding how many dimensions to retain. 26

Beta-diversity; ordination analysis 27 ISME J. 2016 Mar 25. doi: 10.1038/ismej.2016.37 Differentiation of microbiota between diabetic and non-diabetic subjects and across body sites 28 (a) Hands B. Feet C. Hands and Feet! Weighted Unifrac (PC1 [28.5 %] PC2[12.4 %]) Control Diabetic Weighted Unifrac (PC1 [54.0 %] PC2[ 7.9 %]) Control Diabetic Weighted Unifrac (PC1 [37.8 %] PC2[ 7.4 %]) Arm Control Foot Control Arm Diabetic Foot Diabetic Redel et al. J Infect Dis. 2013 207(7):1105-14

Correspondence analysis Is obtained by analyzing the eigen values of the chi-square transformed counts Y. y %% y %& y %' y %% + y %& + y %' Y = y &% y && y &', y )* = y &% + y && + y &', y '% y '& y '' y '% + y '& + y '' y *, = y %% + y &% + y '%,, Q = q ), = 0 120 33 40 13 0 32 0 33 0 13 0 32 SVD analysis of Q results in principal components for correspondence analysis Correspondence analysis preserves the chi-square distance. Within class analysis

Within class analysis Suggested reading/references + any proof-based linear algebra text book. 32

Suggested reading Susan Holmes Multivariate Data Analysis: The French Way, IMS Lecture Notes Monograph Series, 2006. 33