Compositional data methods for microbiome studies

Size: px

Start display at page:

Download "Compositional data methods for microbiome studies"

Augustus Heath
5 years ago
Views:

1 Compositional data methods for microbiome studies M.Luz Calle Dept. of Biosciences, UVic-UCC 1

2 Important role of the microbiome in human health 2

3 Microbiome and HIV How the gut microbiome affects inmune reconstitution, HIV-1 replication and chronic inflammation in HIV-1 infected individuals. Dynamics of microbiome and the inflammatory response after HIV infection. How the human microbiome can influence the AIDS vaccine response. 3

4 Outline 1. Why a new algorithm for microbiome analysis is needed? 2. Present "SelBal: Selection of Balances", a new algorithm for microbiome differential abundance testing 4

5 Microbiome study 5

6 OTU: Operational Taxonomic Unit and Taxonomy assignment Sequences that are highly similar (e.g. 97%) are clustered together into OTUs which are used in place of microbial species. OTU1 OTU2 OTU3 OTU4 OTUs 6

7 OTU table or Abundance table Taxon1 Taxon2... TaxonM OTU1 OTU2 OTU3... OTUK TOTAL Sample1 X 11 X 12 X X 1k N 1 Sample2 X 21 X 22 X X 2k N Samplep X p1 X p2 X p3... X pk N p 7

8 Microbiome differential abundance testing Multivariate analysis: Are there global differences in microbial composition between sample groups? Adonis=PERMANOVA Univariate testing: Which taxa are differentially abundant between sample groups? Wilcoxon, DESeq2, EdgeR,... 8

9 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa 9

10 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa 10

11 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa Significant results: Wilcoxon: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" DESeq2: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" edger: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" 11

Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa Significant results: Wilcoxon: "TAXA_1"

12 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in observed abundances of the other taxa Significant results: Wilcoxon: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" DESeq2: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" edger: "TAXA_1" "TAXA_2" "TAXA_3" "TAXA_4" "TAXA_5" Univariate tests for compositional data: many significant findings are False Positive 12

13 Microbiome Compositional data 13

14 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa 14

15 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 15

16 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 or a constant shift S = log(1 π 1 )/(1 π 1 ) in log-relative abundances: log (π j ) = log(π j ) + S, S = log ( 1 π 1 ) for j 1 1 π 1 16

17 Microbiome Compositional data Microbiome data is compositional: the change in abundance of one taxon induces changes in the observed abundances of the other taxa If taxon 1 relative abundance changes from π 1 to π 1 we will observe the other taxa relative abundances to change by a constant factor F = (1 π 1 )/(1 π 1 ) π j = π j F, F = 1 π 1 1 π 1 for j 1 or a constant shift S = log(1 π 1 )/(1 π 1 ) in log-relative abundances: log (π j ) = log(π j ) + S, S = log ( 1 π 1 ) for j 1 1 π 1 In the toy example: π 1 = > π 1 = > F = (1 π 1 )/(1 π 1 ) = 1/2 17

18 Microbiome Compositional data HPylori before π HPylori after π Shift in log-relative abundance: S = log(1 π 1 ) log(1 π 1 ) 4 18

19 Compositional data: log-ratio analysis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. CODA: Analyze log-ratios between taxa: log (X i /X j ) = log (π i /π j ) Toy example: only log-ratios that involve taxa1 are different: log(x A 1 /X A 2 ) log(x B 1 /X B 2 ) log(0.2/0.2) log(0.6/0.1)... log(x A 2 /X A 3 ) = log(x B 2 /X B 3 ) log(0.2/0.2) = log(0.1/0.1)... 19

CODA: Analyze log-ratios between taxa: log (X i /X j ) = log (π i /π j ) Toy example: only

20 Compositional data: log-ratio analysis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. CODA: Analyze log-ratios between taxa: log (X i /X j ) = log (π i /π j ) Toy example: only log-ratios that involve taxa1 are different: log(x A 1 /X A 2 ) log(x B 1 /X B 2 ) log(0.2/0.2) log(0.6/0.1)... log(x A 2 /X A 3 ) = log(x B 2 /X B 3 ) log(0.2/0.2) = log(0.1/0.1)... 20

21 Compositional balances: a new perspective for microbiome analysis Javier Rivera, PhD thesis Let X = (X 1, X 2,, X k ) be a composition of microbiome abundances. Instead of individual abundances, we analyze relative abundances between groups of taxa: Compositional balances Extension of the concept of log-ratio between two taxa: log (X i /X j ) = log (π i /π j ) Let's X + and X two disjoint subsets of components in X. The balance between X + and X is defined as: B = k + k log ( k + + k 1 i I X + i) 1 j I ) ( X j k + k 1 k i I+ log X i 1 + k j I log X j 21

Selbal: an algorithm for selection of balances Y, response variable, numeric or dichotomous, X = (X 1, X 2,, X k ) compositioin Z = (Z 1, Z 2,, Z r ) covariates Goal: to determine the

22 Selbal: an algorithm for selection of balances Y, response variable, numeric or dichotomous, X = (X 1, X 2,, X k ) compositioin Z = (Z 1, Z 2,, Z r ) covariates Goal: to determine the sub-compositions X + and X so that the balance B between X + and X is highly associated with Y after adjustment for Z For a continuous variable Y: For a dichotomous variable Y: Y = β 0 + β 1 B + γ Z logit(y) = β 0 + β 1 B + γ Z 22

23 Selbal: an algorithm for selection of balances STEP 0: Zero replacement STEP 1: Optimal balance between two components, B (1) The algorithm evaluates exhaustively all possible balances between two components: B = 1 (log(x 2 i) log (X j )) for i, j {1,..., k} i j. STEP s: Optimal balance adding a new component For s > 1 and given B (s 1), the algorithm evaluates the optimization criterion of the balance that is obtained by adding log(x p ) to B (s 1), for each variable X p that has not been included previously 23

24 B (s 1) M + (s 1) M (s 1) = 1 k + (s 1) log (X i ) i I + (s 1) 1 k (s 1) log (X j ) j I (s 1) B (s+) = (k (s 1) (s 1) + +1) k (s 1) (s 1) ( k (s 1) (s 1) + M+ +log (Xp ) (s 1) M (s 1) ), k + +k +1 k + +1 B (s ) = k (s 1) (s 1) (s 1) + (k +1) (s 1) k (s 1) (s 1) (M k + +k +1 + M (s 1) + log (X p ) (s 1) ), k +1 and selects B (s) that maximizes the optimization criterion (R 2, AUC). STOP criterion: cross-validation 24

25 Cross-validation: selbal.cv Goals: (1) to identify the optimal number of components to be included in the balance (2) to explore the robustness of the global balance identified with the whole dataset. 25

26 Crohn s disease Ren et al. 2015: 662 patients with Crohn s disease and 313 controls. Abundance data at genus level (48 genera) 26

27 27

28 AUC = and cv-auc =

29 Comparison with other methods METHOD Median number of taxa Mean cv-auc selbal DESeq edger ANCOM ALDEx

30 Conclusions The compositonal nature of microbiome data should not be ignored This applies not only to microbiome abundance but aslo to gene counts in microbiome functional analysis. Working with relative abundances among groups of taxa (compositional balances) overcomes the problem of differences in sample size. The algorithm performs forward selection (suboptimal). We are working to develop a new algorithm that finds the optimal balance through penalized regression (LASSO) for compositional data. 30

31 Javier Rivera Marc Noguera Roger Paredes and the MetaHIV group Vera Pawlowsky-Glahn Juan José Egozcue CODA group 31

32 Effects of "closing" compositional data "Closing" compositional data (proportions or rarefaction) induces spurious correlation (Pearson 1896): Two or more variables will be negatively correlated simply because the data are transformed to have a constant sum x = [ ] cor(x) = [ ], cor(π 0.28 x ) = [ ] also induces subcompositional incoherences in both, correlations and distances. 32

33 Statistical challenges of microbiome analysis Sparsity: large proportion of zeros in OTU Multivariate with complex phylogenetic structure High dimensional Compositional data 33

34 Compositional data Let's consider a vector of K positive components or parts x = (x 1, x 2,, x K ) Closed compositional data describe a data set in which the parts in each sample have a constant sum: x i = 1 Compositional data describe a data set in which the parts in each sample have an arbitrary or noninformative sum 34

35 Microbiome Compositional data Microbiome data is compositional: o Row abundances (counts) are not informative: large variability in the total number of counts per sample and total number of counts is related to the instrument (sampling depth), not to microbiome abundance in the environment o Relative abundances (proportions) and rarefaction are used to obtain a closed microbiome composition this may induce strong incoherencies in correlations and distances 35

36 36

Statistical methods for the analysis of microbiome compositional data in HIV studies

1/ 56 Statistical methods for the analysis of microbiome compositional data in HIV studies Javier Rivera Pinto November 30, 2018 Outline 1 Introduction 2 Compositional data and microbiome analysis 3 Kernel