Proposed methods for analyzing microbial community dynamics. Zaid Abdo March 23, 2011

Size: px

Start display at page:

Download "Proposed methods for analyzing microbial community dynamics. Zaid Abdo March 23, 2011"

Cora Carter
5 years ago
Views:

1 Proposed methods for analyzing microbial community dynamics Zaid Abdo March 23, 2011

2 Goals Data Model Preliminary Results

3 There are two main goals for the data analysis as I see it: 1. Provide a viable, preferably mechanistic, understanding of the dynamics of vaginal microbial communities and its association with metadata (cross sectional and over time). 2. Provide a predictive tool that will allow for the classification of newly sampled individuals based on their microbial community composition and their associated metadata. Classification in terms of their BV state, for example.

4 Goal 1 represents a prerequisite for goal 2. Classification (prediction of BV state) requires understanding the characteristics of the population of interest (women) in light of their vaginal community composition and effectors of change of these communities (metadata). There are multiple ways to skin this cat! I will present a Bayesian, model-based framework.

5 A required pre-step to data analysis is to live with the data (Leo Breiman). One characteristic that is readily observable of our data is that It has multiple layers of complexity

BV state, etc Community Composition Data About Communities

6 Static level Dynamic level (Time dependent) Age, Race, Education, Tubal Ligation etc Menses, Vaginal Intercourse, ph, BV state, etc Community Composition Data About Communities Metadata (Environment and Host) Data About OTU s Within Data Communities

7 Static level Age, Race, Education, Tubal Ligation etc Metadata Menses, Vaginal Intercourse, ph, BV state, etc Dynamic level (Time dependent) Unobserved Community Characteristics Latent Data For inference and model simplification Community Composition Data

8 The approach I present is known as Hierarchical (or Multilevel) modeling in the statistics literature. It divides the problem into multiple levels each can have response and explanatory variables. It links the different levels through the parameters of their associated model.

9 In our case, assume that the response variables are Nugent score NS jt and microbial community composition CS jt ={Y ijt } (vector of counts or proportions of OTUs, i) both observed per woman j and over time t. Also assume that we have ph jt measured per woman over time and ethnicity, E j, of a woman as explanatory variables.

10 For any woman, the previously described data structure translates to: Time 0 1 t - 1 t ph j0 ph jt ph Ethnicity E j NS j0 NS jt Nugent Score {Y ij0 } {Y ijt } Community Composi,on

11 We can write a model, probabilistically, to reflect the above as follows: Parameters Hyperparameters Model 2 Model 1 Which is composed of two sub-models one forms a prior for the other.

12 Goal 1 A model Let s take model 1 and see what we can do with it. For example, one can think about the count Y ijt, as Poisson distributed with a certain parameter E j ph jt modeled using a state space model.

13 Goal 1 A model The above is ok but: 1. The data at hand is sparse; i.e., it has a lot of zeros for a lot of OTUs. So for any time point not all of the OTUs exist. 2. There are two ways an OTU count can be zero: a) they don t exist or b) they exist but we didn t sample them. So we have to account for our detection ability. We account for those two limitations by modeling them using a resource-occupancy lookalike model.

0 0 0 0 Probability of initial existence 1 1 1 1 Mean population size N ij0 λ ij0 p ij0

14 For woman j, and OTU i: Time z ij0 Probability of continuing absence 0 1 t - 1 t z ijt Community Composi,on = 1 if OTU i exists = 0 if not Probability of initial existence Mean population size N ij0 λ ij0 p ij0 Probability of persistence Detection Probability p ijt N ijt λ ijt population Size of OTU i at time t in woman j Y 1ij0 Y 2ij0 Y 3ij0 Observed Data Y 1ijt Y 3ijt Y 2ijt

15 For woman j, and OTU i: Community Composi,on Time 0 1 t - 1 t z ij0 z ijt λ ij0 = N ij0 x p ij0 λ ijt = N ijt x p ijt Y ij0 Y ijt z ijt is latent data; = 1 if OTU i exists = 0 if not. To be modeled parameters.

16 What about the Nugent score? We can model it in a similar manner as described above, though this is not the part I am going to talk about! Given the microbial community structure and all other variables can we predict the Nugent score for a woman? Using Bayes rule we can write this as follows:

17 The above can be re-written as: See the resemblance to our model?

18 Too many parameters to work with. Though, not as many as you think: The hierarchy correlates these parameters and links them through the hyperparameters. So we have an effective number of parameters rather than counting all parameters as being separate. This is an interesting concept that reduces the risk of overparameterization.

19 The following is based on a simplified model with only Nugent score and community composition for woman 403, 407, 411, 415, 424, 434 & 436. Figure 1: The posterior estimate of chance that L. iners (left) and L. gasseri (right) will persist in the next time period given that it was part of the community in the previous time period per each women.

20 Figure 2: The posterior expected, smoothed (given all data), abundance of L. iners (left) and L. gasseri (right) for woman 403.

22 Figure 3: The posterior chance of having a Nugent score of 0 (left) and 8 (right) per-woman regardless of the time.

23 I think that this is a powerful approach worth pursuing.

25 Goal 1 A model For this model we go through the following steps for woman j: At time 0: OTU i either exists with probability ψ ij0 or it does not. We introduce the missing data z ij0 such that if OTU i exists then z ij0 = 1 else it equals 0. If OTU i exists then it has abundance N ij0 and will be detected with probability p ij0. N ij0 is assumed to follow a Poisson distribution, The observed abundance Y ij0 then follows a binomial distribution

26 Goal 1 A model At time t: i) if OTU i exists at time t-1 then z ij(t-1) = 1, and it will persist with probability φ ijt (i.e., z ijt = 1). it will have abundance N ijt and will be detected with probability p ijt. Accordingly, ii) if OTU i does not exist at time t-1 then z ij(t-1) = 0, and it might colonize with probability γ ijt (i.e., z ijt = 1). it will have abundance N ijt and will be detected with probability p ijt.

27 Goal 1 A model Parameters that will be modeled are: λ will be modeled using a log-linear model while the others will be modeled using logistic regression. Remember the model structure before!! E j ph jt

Represent processes and observations that span multiple levels (aka multi level models) R 2

Hierarchical models Hierarchical models Represent processes and observations that span multiple levels (aka multi level models) R 1 R 2 R 3 N 1 N 2 N 3 N 4 N 5 N 6 N 7 N 8 N 9 N i = true abundance on a