Statistical methods for the analysis of microbiome compositional data in HIV studies

Size: px

Start display at page:

Download "Statistical methods for the analysis of microbiome compositional data in HIV studies"

Garry Bailey
5 years ago
Views:

1 1/ 56 Statistical methods for the analysis of microbiome compositional data in HIV studies Javier Rivera Pinto November 30, 2018

2 Outline 1 Introduction 2 Compositional data and microbiome analysis 3 Kernel machine regression 4 Identification of microbial signatures 5 Conclusions 2/ 56

3 Thesis structure Applied part Theoretical part Introduction Characteristics of microbiome abundance matrices Compositional data Kernel machine regression Identification of microbial signatures 3/ 56

4 Introduction 4/ 56

Human microbiome Human microbiome: collection of all the microorganisms living

It accounts for about 1 to 3 percent of total body mass.

essential functions: food digestion, immune system maintenance,.

5 Human microbiome Human microbiome: collection of all the microorganisms living in association with the human body, including archaea, bacteria, and viruses. It accounts for about 1 to 3 percent of total body mass. Archaea Bacteria Virus The human microbiome is involved in a large number of essential functions: food digestion, immune system maintenance,... Alterations in microbiota have been associated to high impact diseases: asthma, cardiovascular disease, cancer,... 5/ 56

Microbiome and HIV Gut houses most of immune cells Damages in the gut epithelium Bacterial translocation triggers inflammation processes Systemic and chronic inflammation IrsiCaixa

6 Microbiome and HIV Gut houses most of immune cells Damages in the gut epithelium Bacterial translocation triggers inflammation processes Systemic and chronic inflammation IrsiCaixa is investigating how we can act on the microbiome to help people living with HIV recover immunity, and to strenghten the immune response of a therapeutic or preventive vaccine. 6/ 56

7 Data extraction Microbiome studies are based on microbial DNA sequencing through two main approaches: amplicon sequencing and whole metagenomics DNA shotgun sequencing Amplicon sequencing: sequences a phylogenetic marker gene after Polymerase chain reaction (PCR) amplification. Shotgun metagenomics sequencing: sequences the total microbial DNA of a sample. 7/ 56

8 Data extraction (II) 8/ 56

9 Microbiome abundance matrix Microbiome abundance table is usually expressed as a matrix of counts, denoted by X, with k columns (taxa) and n rows (samples). Each entry x ij of X is the number of sequences (reads) corresponding to taxon j in sample i. x 11 x x 1k x 21 x x 2k X =.... x n1 x n2... x nk x ij N {0}, i {1,..., n}, j {1,..., k} 9/ 56

is addressed with different type of normalization methods: working with proportions, rarefying counts or using percentile

10 Introduction Compositional data and microbiome analysis Kernel machine regression Identification of microbial signatures Conclusions Standard microbiome statistical analysis (I) Normalization: the large variability of the total counts per sample is addressed with different type of normalization methods: working with proportions, rarefying counts or using percentile transformations. Amplification error... error error error Sequencing error error error st cycle 2nd cycle c-th cycle 10/ 56

11 Standard microbiome statistical analysis (II) Diversity analysis: α and β diversity are measured with indices of richness, evenness and distances between the composition of samples. 11/ 56

12 Standard microbiome statistical analysis (III) Ordination plots: graphical representations to represent graphically the multidimensional data into two or three orthogonal axes, preserving the main trends of the data. 12/ 56

13 Standard microbiome statistical analysis (III) Differential abundance testing: evaluation of differences in composition between groups of samples, globally or for particular taxa. It can be analyzed from a multivariate perspective or using univariate tests, Multivarite analysis: test for global differences: PERMANOVA, ANOSIM, Kernel machine regression or Dirichlet-Multinomial distribution. Univariate analysis: test for differences for a particular taxon: edger or DESeq2. 13/ 56

14 Contributions to microbiome-hiv studies 14/ 56

15 Compositional data and microbiome analysis 15/ 56

16 Characteristics of X Some characteristics of X may suppose a problem for the analysis: The high variability of the total number of counts along individuals Normalization The constriction induced by the maximum number of sequence reads of the DNA sequencer Compositional nature The presence of high amount of zeros Zero replacement 16/ 56

17 Compositional data A composition is a vector of k strictly positive components or parts x = (x 1,..., x k ) ; x i > 0, i {1,..., k} with a constrained or noninformative total sum k x i. Properties Each component is not informative by itself Relevant information is contained in the ratios Two proportional vectors are equally informative i=1 17/ 56

18 Equivalence class The simplex is the sample space of compositional data. Two vectors x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ) are compositionally equivalent (denoted by = a ), if they are proportional, that is: x 1 = a x 2 p > 0, x 1 = px 2 Each equivalence class has a representative in the unit simplex defined as: k S k = {x = (x 1,..., x k ), x i > 0, x i = 1} i=1 18/ 56

19 Conditions for a proper analysis of compositions Permutation invariance: a change in the order of the parts in the composition should not affect the results Scale invariance: any function f used for the analysis of compositional data must be invariant for any element of the same compositionally equivalent class Subcompositional coherence: results obtained when a subset of components is analyzed should not contradict those obtained when analysing the whole composition. 19/ 56

20 Issues when the compositionality is ignored Ignoring the compositional nature in microbiome studies may induce: Spurious correlations: the total sum constraint characterizing compositional data forces some of the correlations between components to be negative. Subcompositional incoherences: results obtained for a sub-composition (a subset of the components) may disagree with those obtained for the whole composition. Increase of type I error: differential abundance testing is highly affected when the compositional nature of microbiome datasets is not acknowledged, presenting an increase of false positive findings. 20/ 56

21 Aitchison geometry: log-ratio approach Any meaningful (scale-invariant) function of a composition can be expressed in terms of ratios of its components (Atichison, 1986) The simplest invariant function is given by the log-ratio between two components: ( ) xi f (x) = log, i, j {1,..., k} x j The generalization of the log-ratio is called log-contrast, defined as: f (x) = k k a i log(x i ); a i = 0 i=1 i=1 21/ 56

22 Aitchison geometry: vector space structure The perturbation and powering operations in the k-dimensional simplex give it a vector space structure. Given two compositions x, y S k, the perturbation of x by y is given as x y = C (x 1 y 1, x 2 y 2,..., x k y k ) the power transformation or powering of the composition x by a constant α R is defined as α x = C (x α 1, x α 2,..., x α k ). The vector space is Euclidean since a norm and a distance (Aitchison distance) are defined. Aitchison distance between x and y compositions is defined as: d a (x, y) = x y a = 1 2k k i=1 k j=1 ( log x i log y ) 2 i. x j y j 22/ 56

23 Aitchison geometry: coordinate representations Several data transformations have been proposed in order to work in the real space instead of working in the simplex: alr, clr and ilr transformations. Through these transformations a common procedure for the statistical analysis consists on: Formulate the compositional problem in terms of its components Implement the corresponding data transformation Apply the appropriate statistical analysis Translate back the results into terms of initial compositions 23/ 56

24 Aitchison geometry: alr and clr transformations Additive log-ratio transformation (alr) ( ( ) ( )) x x = (x 1,..., x k ) log 1 xk 1,..., log x k x k A component has to be selected Centred log-ratio transformation (clr) ( ( x = (x 1,..., x k ) log x 1 g(x) where g(x) denotes the geometric mean of x ),..., log ( xk g(x) New coordinates have a constant sum of 0 which implies a singular covariance matrix )) 24/ 56

25 Aitchison geometry: ilr transformations Isometric log-ratio transformation (ilr) Representation of a composition in a particular orthonormal basis in S k It overcomes the problem of singular covariance matrix for clr-transformation Multiple ways of defining an orthonormal basis (Sequential binary partition) Each new coordinate is known as a balance. Given two sets of disjoint indices I + and I with k + and k components respectively, the associated balance is given by ( )1 /k + k+ k B = i I X k ++k log + i ( j I X j )1 /k 25/ 56

26 Zeros in microbiome data analysis Depending on the particular study and the method used for measuring the information we can distinguish three different type of zeros: Count zeros: related to processes that may be compared with a multinomial. Rounded zeros: values bellow a detection limit (continuous variables). Essential zeros: represent the total absence of the taxon in sample s environment. 26/ 56

27 Dealing with count zeros There are two extended ways for replacing zeros in compositional data analysis: Substitution by a pseudocount: Replace zeros by 0.65 Add 1 count to the whole matrix Geometric Bayesian Multiplicative Replacement: using prior information, all values are replaced so that there are no zeros and the ratio between non-zero components are preserved. 27/ 56

28 Kernel machine regression 28/ 56

29 Overall question Question: is it there any relationship between microbiome and a response variable of interest? Model-based multivariate methods: they assume a Dirichlet-Multinomial distribution for the abundance matrix {HMP} package (LaRosa, 2012) Dirichlet-Multinomial regression (Chen, 2013) Distance-based multivariate methods: they are based on a distance matrix D. Analysis of similarity (ANOSIM) Permutational Analysis of Variance (PERMANOVA) Kernel machine regression 29/ 56

30 Kernel machine regression: formulation Kernel machine regression is a semi-parametric regression model that includes a non-parametric component to associate a set of covariates X, for instance microbiome abundances, with a response variable of interest Y. Y i = β 0 + β Z i + h(x i ) + ɛ i logit(y i ) = β 0 + β Z i + h(x i ) for continuous outcomes for dichotomous outcomes The non-parametric part measures the relationship between the microbiome composition and the outcome. 30/ 56

31 Kernel machine regression: association test The association is evaluated according to the following hypothesis: H 0 : h(x) = 0 H 1 : h(x) 0 Kernel machine regression is a special kind of mixed model where h(x) is a subject-specific random effect h(x) N(0, τk), where K is the kernel matrix defined from the distance matrix D as: K = 1 2 (I 11T n ) D 2 (I 11T n Thus, the association test can be rewritten as: H 0 : τ = 0 H 1 : τ 0 ) 31/ 56

32 Kernel machine regression: compositional data (I) Similar to the standard Kernel machine regression it is defined as: Y i = β 0 + β Z i + h(clr(x i )) + ɛ i logit(y i ) = β 0 + β Z i + h(clr(x i )) for continuous outcomes for dichotomous outcomes where the test for evaluating the association is given by: H 0 : h(clr(x)) = 0 H 1 : h(clr(x)) 0 being the kernel matrix defined using the Atichison distance matrix D A as: K = 1 ) ) (I 11T 2 D A (I 11T 2 n n 32/ 56

33 Kernel machine regression: compositional data (II) The procedure, named MiRKAT-CoDA is defined by the following steps: 1 Zero replacement 2 Compute Aitchison distance matrix: once the count matrix does not contain zeros, the clr() function of {compositions} package is used to compute the centered log-ratio transformed values. Then, Euclidean distance is calculated getting Aitchison distance matrix D A. 3 Obtain Kernel matrix: using D2K() function in {MiRKAT} package. It transforms the distance matrix D A into the Kernel matrix K A. 4 Implement Kernel machine regression: MiRKAT() function in the package with the same name, performs the kernel machine regression given the response variable, the Kernel matrix and the covariate adjustment. 33/ 56

34 Kernel machine regression: weighted version (I) Weighted Aitchison distance Given two compositions denoted by x 1 = (x 11,..., x 1k ) and x 2 = (x 21,..., x 2k ), and a vector of weights w = (w 1,..., w k ), ) k 2 y 1i d w (x 1, x 2 ) = w i (log g w (y 1 ) log y 2i g w (y 2 ) i=1 where y 1 and y 2 are the initial compositions x 1 and x 2 divided by w ( ) y i = x i = xi1,..., x ik w w 1 w k and g w ( ) denotes the weighted geometric mean where the s w = k w i is the total sum of the weights ( g w (y) = exp 1 sw k i=1 ) w i log(y i ) i=1 34/ 56

35 Kernel machine regression: weighted version (II) The procedure to measure the contribution of each part in the association is called weighted MiRKAT-CoDA and is given by: 1 Zero replacement 2 MiRKAT-CoDA with weights: considering a sequence of weights S = {s 1,..., s q }, for each component X i, i {1,..., k} and for each particular value s r S, the components of w = (w 1,..., w k ) are defined as: { wj = 1 j i w i = s r, s r S For each pair of weight and variable, a p-value is obtained after running MiRKAT() function, getting a table like the following one: 35/ 56

36 Kernel machine regression: weighted version (III) Taxon 1 Taxon 2... Taxon k w = s 1 p 11 p p 1k w = s 2 p 21 p p 2k w = s q p q1 p q2... p qk 3 Linear regression: the contribution of each taxon is summarized by the slope of the linear regression model between theh weights S and minus the logarithm of the p-values obtained for each different weight. The larger the slope, the larger is the contribution of the variable to the global association. 4 Slope ranking: once the contribution of each taxon has been estimated they can be ranked in a decreasing order so that the most important features appear on the top of the list. 36/ 56

37 Application to microbiome-hiv association (I) Data information 156 subjects (127 HIV-positive individuals and 29 HIV-negative individuals) Microbial information for 60 different genera MSM variable (Men who have sex with men) as a possible confounding variable. MiRKAT-CoDA The result of running this algorithm to the HIV-study, is a p-value of if we do not adjust by MSM, and a p-value of after adjusting for sexual practice. 37/ 56

38 Application to microbiome-hiv association (II) Weighted - MiRKAT-CoDA g_rc9_gut_group Taxa contribution log10(p val) g_bacteroides f_erysipelotrichaceae_g_unclassified f_ruminococcaceae_g_incertae_sedis g_bacteroides g_succinivibrio g_oribacterium f_vadinbb60_g_unclassified Weights g_phascolarctobacterium g_roseburia g_anaerostipes Slope size 38/ 56

39 Discussion Contributions MiRKAT-CoDA, the package for Kernel machine regression using Aitchison distance, avoids the possible incoherences resulted with those measures non subcompositionally dominant. The weighted version of MiRKAT-CoDA allows to rank different taxa according to their importance in the global association with the outcome. Limitations The default set of weights S may result uninformative. If the reference global p-value (when no weights are considered) is very small or zero, the weighting method is not very informative since changing the weight of just one taxon can hardly modify the global p-value, remaining equal to zero in most cases. 39/ 56

40 Identification of microbial signatures 40/ 56

41 Specific question Question: which specific taxa are associated with the outcome? When the response variable is dichotomous, the question is known as differential abundance analysis and can be addressed in different ways: Based on RNA-seq analysis {edger} package (Robinson, 2010) {DESeq2} package (Anders, 2014) Based on Compositional data analysis {ALDEx2} package (Fernandes, 2013) {ANCOM} package (Mandal, 2015) 41/ 56

42 Microbial signature Microbial signature: groups of microbial taxa that are predictive of a phenotype of interest. Microbial signatures are useful for diagnosis, prognosis or prediction of therapeutic responses. 42/ 56

43 selbal Model selbal is a model selection procedure that searches a sparse model that adequately explains the response variable of interest. Goal Given a numeric or dichotomous response variable Y, a composition X = (X 1,..., X k ) and additional covariates Z = (Z 1,..., Z r ), it determines two disjoint subcompositions of X, X + and X, whose balance B(X +, X ) is highly associated with Y after adjustment for covariates Z. ( k + k X /k + i I+ i)1 B(X +, X ) = log ( k + + k Xj)1/k j I B(X +, X ) 1 log X i 1 k + k i I + j I log X j 43/ 56

44 selbal: algorithm (I) selbal looks for the best balance following these steps: 1 Zero replacement 2 Optimal balance between two components: exhaustive evaluation of all the possible balances composed by only two components; that is, all balances of the form: 1 ( B ij = B(X i, X j ) = log(xi ) log(x j ) ) 2 for i, j {1,, k}, i j Depending on the class of the response variable, each balance B ij is tested for association with Y with: Y = β 0 + β 1 B ij + γ Z logit(y) = β 0 + β 1 B ij + γ Z for continuous responses for dichotomous responses The balance that maximizes the optimization criteria is selected and denoted by B (1) 44/ 56

45 selbal: algorithm (II) 3 Optimal balance adding a new component: for s > 1 and until the stop criterion is fulfilled, let B (s 1) be the balance defined in the previous step (s 1) given by: B (s 1) 1 k (s 1) + i I (s 1) + log(x i ) 1 k (s 1) j I (s 1) log(x j ) where I (s 1) + and I (s 1) are two disjoint subsets of indices in {1,, k} with k (s 1) + and k (s 1) elements, respectively. For each of the remaining variables, X p not yet included in the balance, p / ( ) I (s 1) + I (s 1), the algorithm considers the balance that is obtained by adding log(x p ) to the positive part of B (s 1) or to its negative part): 45/ 56

46 selbal: algorithm (III) B (s+) p 1 k (s 1) B (s ) p 1 k (s 1) + i I (s 1) + ( i I (s 1) + log(x i ) ) log(x i ) + log(x p ) 1 k (s 1) ( 1 k (s 1) + 1 j I (s 1) j I (s 1) log(x j ) ) log(x j ) + log(x p ) Each of these pairs of balances B (s+) p and B (s ) p for each of the remaining variables X p is tested for association with the response variable through the corresponding regression model. Finally, the balance that maximizes the optimization criterion, defines the new balance B (s) for the s-th step. 46/ 56

47 selbal: association measure and stop criterion Association measure For continuous responses, it is the mean squared error (MSE) of the linear regression model. For dichotomous outcomes, it is the area under the ROC curve. Stop criterion There are two possible stopping rules: The algorithm stops when the improvement of the optimization parameters is lower than a specified threshold (default equal to 0). The algorithm stops when the specified maximum number of componentes has been included in the balance (default equal to 20). 47/ 56

48 Testing dataset selbal: cross validation A cross-validation procedure is implemented for: Defining the optimal number of variables in the balance Measure the robustness of the result Training dataset Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 48/ 56

49 Application to microbiome-crohn s disease association (I) Data information 975 subjects (662 patients with Crohn s disease and 313 without any symptom Microbial information of 49 different taxa Optimal number of variables Accuracy (AUC) Number of variables 49/ 56

50 Application to microbiome-crohn s disease association (II) Global balance DENOMINATOR g Streptococcus g Dialister g Adlercreutzia g Dorea g Oscillospira o Lactobacillales_g g Aggregatibacter g Eggerthella NUMERATOR g Roseburia o Clostridiales_g g Bacteroides f Peptostreptococcaceae_g TPR ROC curve AUC ROC FPR Balance CD Factor no 50/ 56

51 Application to microbiome-crohn s disease association (III) Cross-validation % Global BAL 1 BAL 2 BAL 3 g Dialister 100 g Roseburia 100 o Clostridiales_g 98 g Bacteroides 98 g Dorea 96 o Lactobacillales_g 94 g Eggerthella 92 g Aggregatibacter 92 g Adlercreutzia 90 f Peptostreptococcaceae_g 86 g Streptococcus 76 g Oscillospira 72 g Actinomyces 26 g Blautia 24 FREQ / 56

52 selbal against other methods Results using Crohn s disease dataset with selbal and other methods according to the variable selection and model building procedure. Method comparison 0.8 AUC DESeq edger ANCOM ALDEx2 selbal 52/ 56

53 Discussion Contributions selbal is an alternative for the differential abundance tests available in the literature. The resulted balance has an biological meaning and avoids the problems with type I error usually associated with classical differential abundance tests. selbal offers better results than the most extended differential abundance tests. Limitations The algorithm does not cover all the possible balances defined from a set of k taxa, so the result can be suoptimal. 53/ 56

54 Conclusions 54/ 56

55 What we have learned about microbiome and HIV infection The chronic inflammation derived after HIV infection is responsible of an increased risk of presenting non-aids related diseases and premature aging. Previous results indicating a clear shift from Bacteroides to Prevotella in HIV-1 infection should be revised accounting for possible confounders such as HIV risk factors, exercise or diet. Patients who spontaneously maintain sustained control of HIV, elite controllers (EC), have different microbiota from individuals with progressive infection and more similar to HIV negative individuals. Though diet is known to have an important effect on gut microbiome composition in healthy individuals, measuring its effects on HIV infection is difficult because of the lack of extensive and reliable information at this level. 55/ 56

56 New approaches for the analysis of microbiome compositional data Microbiome abundance data is compositional The constraint over the total number of reads induces strong dependencies among the abundances of different taxa. The use of standard statistical methods ignoring the compositionality can lead to important adverse implications. Kernel machine regression combined with Aitchison distance provides a powerful framework for testing global associations between microbiome and a response variable of interest. Kernel machine regression combined with weighted Aitchison distance provides a measure of the contribution of each taxon to the joint microbiome association with the outcome. The search of microbial signatures with selbal is a powerful approach for defining biomarkers to differentiate groups of samples or to identify associations. 56/ 56

Compositional data methods for microbiome studies

Compositional data methods for microbiome studies M.Luz Calle Dept. of Biosciences, UVic-UCC http://mon.uvic.cat/bms/ http://mon.uvic.cat/master-omics/ 1 Important role of the microbiome in human health