Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Similar documents
Lecture 5: Ecological distance metrics; Principal Coordinates Analysis. Univariate testing vs. community analysis

Multivariate Statistics

Multivariate Statistics

Last time: PCA. Statistical Data Mining and Machine Learning Hilary Term Singular Value Decomposition (SVD) Eigendecomposition and PCA

Jakarta International School 6 th Grade Formative Assessment Graphing and Statistics -Black

New Educators Campaign Weekly Report

Multivariate Statistics

Outline. Administrivia and Introduction Course Structure Syllabus Introduction to Data Mining

Standard Indicator That s the Latitude! Students will use latitude and longitude to locate places in Indiana and other parts of the world.

Challenge 1: Learning About the Physical Geography of Canada and the United States

Multivariate Analysis

Correction to Spatial and temporal distributions of U.S. winds and wind power at 80 m derived from measurements

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee May 2018

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2017

Cooperative Program Allocation Budget Receipts Southern Baptist Convention Executive Committee October 2018

, District of Columbia

Hourly Precipitation Data Documentation (text and csv version) February 2016

Additional VEX Worlds 2019 Spot Allocations

Printable Activity book

Abortion Facilities Target College Students

QF (Build 1010) Widget Publishing, Inc Page: 1 Batch: 98 Test Mode VAC Publisher's Statement 03/15/16, 10:20:02 Circulation by Issue

Preview: Making a Mental Map of the Region

A. Geography Students know the location of places, geographic features, and patterns of the environment.

Chapter. Organizing and Summarizing Data. Copyright 2013, 2010 and 2007 Pearson Education, Inc.

Club Convergence and Clustering of U.S. State-Level CO 2 Emissions

2005 Mortgage Broker Regulation Matrix

Office of Special Education Projects State Contacts List - Part B and Part C

North American Geography. Lesson 2: My Country tis of Thee

Online Appendix: Can Easing Concealed Carry Deter Crime?

Parametric Test. Multiple Linear Regression Spatial Application I: State Homicide Rates Equations taken from Zar, 1984.

JAN/FEB MAR/APR MAY/JUN

RELATIONSHIPS BETWEEN THE AMERICAN BROWN BEAR POPULATION AND THE BIGFOOT PHENOMENON

Osteopathic Medical Colleges

Multivariate Classification Methods: The Prevalence of Sexually Transmitted Diseases

What Lies Beneath: A Sub- National Look at Okun s Law for the United States.

SUPPLEMENTAL NUTRITION ASSISTANCE PROGRAM QUALITY CONTROL ANNUAL REPORT FISCAL YEAR 2008

Intercity Bus Stop Analysis

Summary of Natural Hazard Statistics for 2008 in the United States

Alpine Funds 2017 Tax Guide

Alpine Funds 2016 Tax Guide

Meteorology 110. Lab 1. Geography and Map Skills

Crop Progress. Corn Mature Selected States [These 18 States planted 92% of the 2017 corn acreage]

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

LABORATORY REPORT. If you have any questions concerning this report, please do not hesitate to call us at (800) or (574)

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

MINERALS THROUGH GEOGRAPHY

Grand Total Baccalaureate Post-Baccalaureate Masters Doctorate Professional Post-Professional

An Analysis of Regional Income Variation in the United States:

OUT-OF-STATE 965 SUBTOTAL OUT-OF-STATE U.S. TERRITORIES FOREIGN COUNTRIES UNKNOWN GRAND TOTAL

Statistical Methods for Data Mining

High School World History Cycle 2 Week 2 Lifework

15 Singular Value Decomposition

JAN/FEB MAR/APR MAY/JUN

BlackRock Core Bond Trust (BHK) BlackRock Enhanced International Dividend Trust (BGY) 2 BlackRock Defined Opportunity Credit Trust (BHL) 3

Insurance Department Resources Report Volume 1

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

MINERALS THROUGH GEOGRAPHY. General Standard. Grade level K , resources, and environmen t

extreme weather, climate & preparedness in the american mind

Mathematical foundations - linear algebra

GIS use in Public Health 1

Non-iterative, regression-based estimation of haplotype associations

Image Registration Lecture 2: Vectors and Matrices

Lecture 2: Diversity, Distances, adonis. Lecture 2: Diversity, Distances, adonis. Alpha- Diversity. Alpha diversity definition(s)

Linear Algebra (Review) Volker Tresp 2017

National Organization of Life and Health Insurance Guaranty Associations

FLOOD/FLASH FLOOD. Lightning. Tornado

Review of Linear Algebra

Pima Community College Students who Enrolled at Top 200 Ranked Universities

Large Scale Data Analysis Using Deep Learning

United States Geography Unit 1

112th U.S. Senate ACEC Scorecard

Rank University AMJ AMR ASQ JAP OBHDP OS PPSYCH SMJ SUM 1 University of Pennsylvania (T) Michigan State University

A Summary of State DOT GIS Activities

HP-35s Calculator Program Lambert 1

MO PUBL 4YR 2090 Missouri State University SUBTOTAL-MO

KS PUBL 4YR Kansas State University Pittsburg State University SUBTOTAL-KS

Introducing North America

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Linear Algebra Review. Vectors

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Infant Mortality: Cross Section study of the United State, with Emphasis on Education

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Distances and similarities Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Linear Algebra (Review) Volker Tresp 2018

CS 246 Review of Linear Algebra 01/17/19

This project is supported by a National Crime Victims' Right Week Community Awareness Project subgrant awarded by the National Association of VOCA

Package ZIM. R topics documented: August 29, Type Package. Title Statistical Models for Count Time Series with Excess Zeros. Version 1.

Evidence for increasingly extreme and variable drought conditions in the contiguous United States between 1895 and 2012

HP-35s Calculator Program Lambert 2

A GUIDE TO THE CARTOGRAPHIC PRODUCTS OF

Office of Budget & Planning 311 Thomas Boyd Hall Baton Rouge, LA Telephone 225/ Fax 225/

All-Time Conference Standings

My Background 1/6/2011 PLACES I HAVE BEEN TO STUDY GEOLOGY

Linear Algebra for Machine Learning. Sargur N. Srihari

Foundations of Computer Vision

Fungal conservation in the USA

14 Singular Value Decomposition

Multivariate Statistical Analysis

DOWNLOAD OR READ : USA PLANNING MAP PDF EBOOK EPUB MOBI

Transcription:

Lecture 5: Ecological distance metrics; Principal Coordinates Analysis Univariate testing vs. community analysis Univariate testing deals with hypotheses concerning individual taxa Is this taxon differentially present/abundant in different samples? Is this taxon correlated with a given continuous variable? What if we would like to draw conclusions about the community as a whole? 2

Useful ideas from modern statistics Distances between anything (abundances, presence-absence, graphs, trees) Direct hypotheses based on distances. Decompositions through iterative structuration. Projections. Randomization tests, probabilistic simulations. 3 Data à Distances à Statistics S-1 S-2 S-16 0.54 OTU-1 OTU-2 OTU-3 OTU-4 OTU-5. OTU-10,000 0.54 0.34 0.60 0.17 0.10 0.01 0.75 0.17 0.72 0.24 0.34 0.74 0.89 0.58 0.08 0.43 0.82 0.91 0.69 0.64 0.57 0.06 0.59 0.03 0.69 0.89 0.07 0.88 0.68 0.59 0.49 0.19 0.64 0.54 0.26 0.24 0.71 0.58 0.52 0.93 0.94 0.46 0.70 0.30 0.92 0.46 0.17 0.21 0.17 0.02 0.81 0.22 0.45 0.59 0.51 0.23 0.74 0.36 0.15 1.00 0.95 0.71 0.88 0.94 0.65 0.29 0.30 0.75 0.28 0.04 0.45 0.15 0.44 0.59 0.89 0.07 0.19 0.57 0.63 0.27 0.72 0.34 0.70 0.19 0.17 0.85 0.07 0.97 0.24 0.45 0.95 0.53 0.86 0.01 0.10 0.63 0.40 0.10 0.51 1.00 0.20 0.41 0.11 0.14 0.83 0.43 0.12 0.39 0.43 0.87 0.42 0.17 0.62 0.57 0.03 0.58 0.54 0.48 0.83 Visualization (Principal coordinates analysis) Statistical hypothesis testing (PERMANOVA) 4

What is a distance metric? Scalar function d(.,.) of two arguments d(x, y) >= 0, always nonnegative; d(x, x) = 0, distance to self is 0; d(x, y) = d(y, x), distance is symmetric; d(x, y) < d(x, z) + d(z, y), triangle inequality. 5 Using distances to capture multidimensional heterogeneous information A good distance will enable us to analyze any type of data usefully We can build specialized distances that incorporate different types of information (abundance, trees, geographical locations, etc.) We can visualize complex data as long as we know the distances between objects (observations, variables) We can compute distances (correlations) between distances to compare them We can decompose the sources of variability contributing to distances in ANOVA-like fashion 6

Distance and similarity Sometimes it is conceptually easier to talk about similarities rather than distances E.g. sequence similarity Any similarity measure can be converted into a distance metric, e.g. S If S is (0, 1), D=1-S If S>0, D = 1/S or D = exp(-s) 7 A few useful distances and similarity indices Distances: Euclidean: (remember Pythagoras theorem) Σ(x i -y i ) 2 Weighted Euclidean: χ 2 = Σ(e i - o i ) 2 /e i Hamming/L1, Bray Curtis = Σ 1 {xi=yi} Unifrac (later) Jensen-Shannon: (D(X M) + D(Y M))/2, where M = (X + Y)/2 Kullback-Leibler divergence: D(X Y) = Σ ln[x i /y i ]x i Similarity: Correlation coefficient Matching coefficient: (f 11 +f 00 )/(f 11 + f 10 + f 01 + f 00 ) Jaccard Similarity Index: f 11 /(f 11 + f 10 + f 01 ) x\y 1 0 1 f 11 f 10 0 f 01 f 00 8

Unifracdistance (Lozupone and Knight, 2005) Is a distance between groups of organisms related by a tree Definition: Ratio of the sum of the length of the branches leading to sample X or Y, but not both, to the sum of all branch lengths of the tree. 9 Weighted Unifrac(Lozupone et al., 2007) =1 = 10

A note of warning! Garbage in, garbage out Wrong normalization => wrong distance => wrong answer However, given the many choices there isn t much beyond prior knowledge, experience and intuition to guide in selection of the distance. 11 Distance matrix It is convenient to organize distances as a matrix, A=(a ij ) Distance matrices are: Symmetric: a ij = a ji Diagonals are 0: a ii = 0. A distance matrix is Euclidean if it is possible to generate these distances from a set of n points in Euclidean space. Distance matrix is commonly represented by just lower (or upper) diagonal entries. 12

Some uses of distances Suppose D is a distance matrix for n objects. The objects are of several kinds indicated by a factor variable F; Intra-group distances are the distances between objects of the same kind; Inter-group distances are the distances between objects of different kinds; Mean distance between an object and a group of other objects is equal to the distance between the object and the center of the group. 13 PRINCIPAL COORDINATE ANALYSIS 14

Vector A vector, v, of dimension n is an n 1 rectangular array of elements! v $ # 1 # v v = 2 # #! "# v n % vectors will be column vectors. They may also be row vectors, when transposed v T =[v 1, v 2,, v n ]. 15 A vector, v, of dimension n! # # v = # # "# v 1 v 2! v n can be thought a point in n dimensional space $ % 16

Every multivariate sample can be represented as a vector in some vector space v 3 v v 1 = v 2 v 3 v 2 v 1 17 Vector Basis A basis is a set of linearly independent (dot product is zero) vectors that span the vector space. Spanning the vector space: Any vector in this vector space may be represented as a linear combination of the basis vectors. The vectors forming a basis are orthogonal to each other. If all the vectors are of length 1, then the basis is called orthonormal. 18

Matrix An n m matrix, A, is a rectangular array of elements! a 11 a 12! a $ # 1m # a A = ( a ij ) = 21 a 22! a 2m # # " " # " "# a n1 a n2 a nm % n = # of rows m = # of columns dimensions = n m 19 Note: Let A and B be two matrices whose inverse exists. Let C = AB. Then the inverse of the matrix C exists and C -1 = B -1 A -1. Proof C[B -1 A -1 ] = [AB][B -1 A -1 ] = A[B B -1 ]A -1 = A[I]A -1 = AA -1 =I 20

Diagonalization Thereom If the matrix A is symmetric with distinct eigenvalues, λ 1,,! λ n, with corresponding eigenvectors x 1,, x!!! n Assume xi! xi =1!!!! then A = λ 1 x1 x1! + + λ n xn xn!! = [ x! 1,, x! λ 1 " 0 $!! x1' $ # # n ]# # $ # # = PDPʹ # 0 " λ #! "# n xn ' % "# % 21 Basic idea for analysis of multidimensional data Compute distances Reduce dimensions Embed in Euclidean space The general framework behind this process is called Duality diagram: (X nxp, Q pxp, D nxn ) X nxp (centered) data matrix Q pxp column weights (weights on variables) D nxn row weights (weights on observations) 22

Duality diagram defines the geometry of multivariate analysis x Q? R p! R n X? x??y???? y V D W R p X t R n duality diagram because knowledge V = X T DX W = XQX T Duality: The eigen decomposition of VQ leads to eigendecomposition of WD Inertiais equal to trace (sum of the diagonal elements) of VQ or WD. 23 Principal Component Analysis (PCA) Let Q = I and D = 1/n I and let X be centered. VQ = X T DXQ = 1/n X T X. The inertia Tr(VQ) = sum of the variances. PCA decomposes the variance of X into independent components. To decompose the inertia means to find the eigen-system of VQ or equivalently WD matrices. Eigenvalues give the amount of inertia explained in corresponding dimension. Eigenvectors give the dimensions of variability. 24

pc.cr Comp.1 Comp.2 Comp.3 Comp.4 Example PCA Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Variances 0.0 0.5 1.0 1.5 2.0 Screeplot: plot of inertia Loadings: Comp.1 Comp.2 Comp.3 Comp.4 Murder -0.536 0.418-0.341 0.649 Assault -0.583 0.188-0.268-0.743 UrbanPop -0.278-0.873-0.378 0.134 Rape -0.543-0.167 0.818 Comp.2 0.2 0.1 0.0 0.1 0.2 0.3 5 0 5 Mississippi North Carolina South Carolina West VirginiaVermont Georgia Alabama Alaska Arkansas Kentucky Murder Louisiana Tennessee South Dakota North Dakota Montana Maryland Assault Wyoming Maine Virginia Idaho New Mexico Florida New Hampshire Michigan Iowa Indiana Nebraska Missouri Kansas Rape Delaware Oklahoma Texas Oregon Pennsylvania Minnesota Wisconsin Illinois Nevada Arizona Ohio New York Colorado Washington Connecticut New Jersey Massachusetts Utah Rhode Island California Hawaii UrbanPop 0.2 0.1 0.0 0.1 0.2 0.3 Comp.1 Biplot 25 5 0 5 Centering Let Y be not centered data matrix with n observations (rows) and p variables (columns) Let H = (I 1/n 1x1 ) Then X = HY is centered 26

From Euclidean distances to PCA to PCoA Note that if D is a Euclidean distance, then X X = 1/n H D (2) H. PCoA is a generalization of PCA in that knowledge of X is not required, all you need to represent the points is D, the inter-point distance matrix. 27 Representation of (arbitrary) distances in Euclidean space The idea is to use singular value decomposition (SVD) on the centered interpoint distance matrix to extract Euclidean dimensions SVD: X = U S V, where S is diagonal matrix with diagonal elements s 1, s 2,, s n, and U and V are unit matrices (i.e. their determinant is 1 and they span their corresponding spaces) 28

PCoA details Algorithm starting from D inter-point distances: Center the rows and columns of the matrix of square (element-by-element) distances: S = -1/2H D (2) H Compute SVD by diagonalizing S, S = U Λ U T Extract Euclidean representations: X = U Λ 1/2 The relative values of diagonal elements of Λ gives the proportion of variability explained by each of the axes. The values of Λ should always be looked at in deciding how many dimensions to retain. 29 Beta-diversity; ordination analysis ISME J. 2016 Mar 25. doi: 10.1038/ismej.2016.37 30

Differentiation of microbiota between diabetic and non-diabetic subjects and across body sites 31 (a) Hands B. Feet C. Hands and Feet! Weighted Unifrac (PC1 [28.5 %] PC2[12.4 %]) Control Diabetic Weighted Unifrac (PC1 [54.0 %] PC2[ 7.9 %]) Control Diabetic Weighted Unifrac (PC1 [37.8 %] PC2[ 7.4 %]) Arm Control Foot Control Arm Diabetic Foot Diabetic Redel et al. J Infect Dis. 2013 207(7):1105-14 Suggested reading/references 32 + any proof-based linear algebra text book.

Suggested reading Susan Holmes Multivariate Data Analysis: The French Way, IMS Lecture Notes Monograph Series, 2006. 33