Catalogue with Probabilistic Topic Models

Size: px

Start display at page:

Download "Catalogue with Probabilistic Topic Models"

Sheryl Long
5 years ago
Views:

Drexel University, Philadelphia, PA 19104, USA 2 Dept.

1 Inferring Functional Groups from Microbial Gene Catalogue with Probabilistic Topic Models Xin Chen 1, TingTing He 2, Xiaohua Hu 1, Yuan An 1, Xindong Wu 3 1 College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA 2 Dept. of Computer Science at Central China Normal University, Wuhan, China 3 Department of Computer Science, University of Vermont, Burlington, VT, USA 1

2 Backgrounds: Genomics Genomics refers to the analysis of genomes. A genome can be thought of as the complete set of DNA sequences that codes for the hereditary material that is passed on from generation to generation. These DNA sequences include all of the genes (the functional and physical unit of heredity passed from parent to offspring) and transcripts (the RNA copies that are the initial step in decoding the genetic information) included within the genome. Thus, genomics refers to the sequencing and analysis of all of these genomic entities, including genes and transcripts, in an organism. 2

3 Backgrounds: GenBank and NCBI In recent years we see growth of GenBank and NCBI with the advancement of gene sequencing technology. 3

4 Backgrounds: annotating algorithms As the growth of GenBank and NCBI, a lot of annotating algorithms are developed to match genomic sequences to GenBank /NCBI standard reference and attach meta-information to the sequences. 4

5 Backgrounds: meta-information The annotated meta-information involves hierarchical data such as NCBI Taxonomy and Gene Ontology. 5

6 Challenges: Metagenomics With the fast advancing sequencing techniques, large amounts of sequenced genomes and meta-genomes from uncultured microbial samples (microbe) have become available. The goal of metagenomics is to study the genome-wide gene-expression data from uncultured environment samples (like the ocean, soil and human body) and understand the underlying biological processes. 6

7 Research Questions What s the major research questions of our study? We use our data mining framework to investigate following questions: 1) Given a large number of genome fragments from an microbial samples, what genomes are there? Answering this question requires mapping the meta-genomic reads to taxonomic units (usually a homology-based sequence alignment, and this task is also known as taxonomic classification or taxonomic analysis). 2) What are the major functions of these genomes? The answers to this question involve annotating the major functional units (such as signal transduction, metabolic capacity and gene regulatory) on the genome-level (a.k.a. functional analysis). Our research objective: We aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species, tell their functional roles. 7

8 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 8

9 Structural annotation and protein encoding regions Structural annotation Annotating the regions of known open reading frames (ORF s), non-coding genes (rrna, trna, mirna), Promoters and UTR s in the DNA sequences 9

10 Structure annotation and protein encoding regions (continue) NCBI standard d reference sequences have detailed d structural annotations of both non-protein encoding regions (such as trna) and protein encoding regions (CDS) as well as the corresponding gene names (if applicable). The GenBank accession number of each reference sequence is available on each NCBI online query. 10

11 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 11

12 Functional analysis - overview Functional analysis Uncover the major gene functions related to the genomic sequences Requires explaining the biochemical activity (a.k.a. molecular function) of gene product, identifying the biology process to which the gene or gene product contribute (including information about enzyme, pathway and metabolic capabilities related to the gene). 12

13 Homology-based functional analysis(richter and Huson, 2009) Homology-based approach has been recently introduced d to achieve functional annotation for metagenomic reads (Richter and Huson, 2009). The framework begins with a homology based BLASTX algorithm to match the metagenomic fragments against the reference sequences in NCBI database. The BLASTX hits will associate fragments with related protein ID and gene names. After that, with the help of the Gene Ontology (GO) database to refer associated gene names to corresponding GO terms, thus provides an overview of gene function and products for metagenomic fragments. 13

14 Homology-based functional analysis(richter and Huson, 2009) GO terms obtained from database identifier e mapping (Richter and Huson, 2009) 14

15 Limitations with Homology-based Functional Analysis Methods 1. Homology-based approaches very much reply on the result of local l sequence alignment (such as BLAST and BLASTX) to the known open reading frames (ORF). The BLAST-like local alignment may either return hundreds of hits, or return no hits, depending on the threshold of E-value used. In the latter case, the current methods are unable to provide any functional annotation. In the former case, it usually lacks of a proper tie-breaker to further reduce the hits, which h makes the functional annotation some how ambiguous (with hundreds of probable explanation) 2. The homology-based functional annotation methods did not provide any insight about the major functional capabilities of genomes (like which gene functions are more commonly shared by strains from the same species), as there is no priority it for the annotated t GO terms. 15

16 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 16

17 Topic Modeling - Intuitive Intuitive Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted sensory, point brain, by point to visual centers in the brain; the cerebral cortex was a visual, perception, movie screen, so to speak, upon which the image in retinal, the eye was cerebral projected. Through cortex, the discoveries of eye, Hubel cell, and Wiesel optical we now know that behind the origin of the visual perception in the nerve, brain there image is a considerably more complicated Hubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a stepwise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. Assume the data we see is generated by some parameterized random process. g y p j g Learn the parameters that best explain the p y data. Use the model to predict (infer) new g data, based on data seen so far. 17

18 Notations Word Basic unit. Item from a vocabulary indexed by {1,...,V}. Document Sequence of N words, denoted by w = (w1,w2, w2...,wn). Collection A total t of D documents, denoted d by C = {w1,w2,...,wd}. Topic Denoted by z, the total number is K. Each topic has its unique word distribution p(w z) 18

Background & Existing Techniques of Generative Latent Topic Models The Naïve Bayesian model Likelihood of word w given topic z * z = p z w p z p w z arg max ( ) ( ) ( ) Word-Topic decision Prior

19 Background & Existing Techniques of Generative Latent Topic Models The Naïve Bayesian model Likelihood of word w given topic z * z = p z w p z p w z arg max ( ) ( ) ( ) Word-Topic decision Prior Probability of Topic z The probabilistic latent semantic indexing (PLSI) model Assumption: Each document has a mixture of k topics. Fitting the model involves: PLSI Model (Hoffman, 2001) Estimating the topic specific word distributions p(w i i z k ) and document specific topic distributions p(z k d j ) from the corpse via maximum likelihood estimation (MLE). 19

20 Latent Dirichlet Allocation (LDA) Model (Blei, 2003) φ ~ Dir( β ) j θ d ~Dir(α) j d pz ( d)~ Multiθ ( ) j j p( w z )~ Multi( φ ) wi d i, j i, j wi = j wi w-i z-wi. d Wβ + n i, j Tα + n i,. pz (,, ) i β + n α + n In PLSI model, the topic mixture probability p(z k d j )for documents are fixed once the model is estimated. For new coming document, the model needed to be re-estimated. Thus it is not scalable. The LDA model treats the probability of latent topics for each document p(z d) and the conditional probability of words for each latent topic p(w z) as latent random variables which are subject to change when new document comes. 20

21 LDA Model Estimation - Gibbs Sampling Monte Carlo process (Griffiths, 2004) Probability of a topic being assigned to a word given other observations: pz ( = j w, w, z ) pw ( z = j, w, z ) pz ( = j w, z ) wi i -i -wi i wi -i -wi -i -wi j j j pw ( z = j, w, z ) = pw ( z= j, ϕ, w, z ) p( ϕ w, z ) dϕ = i wi -i -wi i -i -wi -i -wi α + n d d d pz ( = j w-i, z-wi ) = pz ( = j θ ) p( θ w-i, z-wi ) dθ = Tα + n d i, j d i,. β + n W β + wi i, j. n i, j j j pw ( z=, jϕ, w, z ) = ϕ i -i -wi p ( ϕ j w, z ) p ( w, z ϕ j ) p ( ϕ j ) in which -i -wi -i -wi j j p( w, z ϕ )~ Multi( ϕ ) -i -wi p( θ d w, z ) p( w, z θ d ) p( θ d ) Since and j and p( ϕ )~ Dir( β). It follows that We have j p( ϕ w-i, z-wi )~ Dir( β + n ) wi i, j -i -wi -i -wi d d p( w, z θ )~ Multi( θ ) -i -wi d p( θ )~ Dir( α) d p( θ w-i, z-wi )~ Dir( α + n ) 21 d i, j

Mote-Carlo process Given the word-topic posterior probability,

similar to throwing dice (given the probability of each facet to

22 Mote-Carlo process Given the word-topic posterior probability, the Monte Carlo process becomes really straightforward, which is similar to throwing dice (given the probability of each facet to appear) to determine the assignment of topics to each words for the next round. Given probability for each word: pz ( = j w, w, z ), j = 1... K wi i -i -wi New topic assignment for each word. 22

23 Statistical relationships of words and topics 23

24 An example of topic assignment to words 24

25 Experiments 25

26 Experiment: Inferring Functional Groups from Microbial Gene Catalogue with Topic Models In our experiment, based on the functional elements derived from non-redundant CDs catalogue, we show that the configuration of functional groups in meta-genome samples can be inferred by probabilistic topic modeling. The probabilistic topic modeling is a Bayesian method that is able to extract useful topical information from unlabeled data. When used to study microbial samples the functional elements (including taxonomic levels, and indicators of gene orthologous groups and KEGG pathway mappings) bear an analogy with words. Estimating the probabilistic topic model can uncover the configuration of functional groups (the latent topic) in each sample. Which may be further used to study the genotype-phenotype p connection of human disease. 26

27 Experimental Data Collection In our experiment, we conduct a probabilistic topic modeling experiment to identify functional groups from human gut microbial community data is generated by [Qin, et al. 2010], which is openly accessible via The human gut microbial samples from [Qin, et al. 2010] belong to both healthy subjects (HS) and patients with inflammatory bowel disease (IBD). Specifically, the IBD patients are from two different groups, one group with Crohn s disease (CD), and the other group with ulcerative colitis (UC). In total, there are 85 healthy samples, 15 UC samples and 12 CD samples. 27

28 Experimental Data Collection (continue) According to [Qin, et al. 2010], the Illumina GA reads from human gut microbial samples are firstly assembled into longer contigs. After that, the Glimmer program was used to predict protein-encoding sequences (CDs) from assembled contigs. The predicted CDs sequences were then aligned to each other and form a non-redundant CDs catalog (a.k.a. minimal gut genome). The non-redundant CDs catalog consists of 3,299,822 non-redundant CDs sequences with an average length of 704 bp. CDs_id: MH0001 Name: GL _ MH0001 _[Lack_ 3'-end] ]_[mrna]_ locus=scaffold96_ 9:1:1206:- Length: 1206 COG/KO: COG4799 K01966 Pathway maping: map00280,map00640 Taxonomic level: species - Eubacterium eligens 28

29 Experimental Data Collection (continue) In our experiment, three types of functional elements are derived from the non-redundant CDs catalog, i.e. the NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators. Given a non-redundant CDs sequence, its NCBI taxonomical level is obtained by carrying out BLASTP alignment against the NCBI NR database. The taxonomical level of each non-redundant CDs sequence is determined by the lowest common ancestor (LCA) based algorithm. The taxonomic abundance data for each sample can be computed by counting the indicators of NCBI taxonomical levels. l The assignments of gene orthologous indicator and KEGG pathway indicator are achieved by BLASTP alignment of the amino-acid sequence from predicted CDs to the eggnog database and KEGG database. 29

30 Experimental Data Collection (continue) NCBI Taxonomic Levels Orthologous Group Indicators Genus Genus Phylum Class Genus Clostridium Bacteroides Firmicutes Clostridia Bacillus COG0463 : Glycosyltransferases involved in cell wall biogenesis COG0642 : Signal transduction histidine kinase COG1132 : "ABC-type multidrug transport system, ATPase and permease components" COG0438 : Glycosyltransferase KEGG Pathway Indicators map00230 : Metabolism_Nucleotide Metabolism_Purine metabolism map00240 : Metabolism_Nucleotide Metabolism_Pyrimidine metabolism map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism The union of unique functional elements jointly defines a fixed word vocabulary. In total, t there are 647,136 NCBI taxonomic level l indicators, with a vocabulary size of 748; there are a total of 1,293,764 gene orthologous group indicators, with a vocabulary size of 4667; and there are 953,493 KEGG pathway indicators, with a vocabulary size of

31 Groups of functional elements in microbial community Given non-redundant CDs catalog, and derived functional elements, we are interested in identifying the frequent co-occurrence occurrence patterns of functional elements (a.k.a. functional groups). 31

32 Generative process of proposed p model Commonly shared functional elements across samples may suggest functional similarity and biological relevance among samples. To cover such information, a genome-wide background distribution of functional elements need to be estimated, which leads to the introduction of the background topic z 0 in topic modeling. 32

33 Illustration of the background topic of gene OGs indicators Background Topic - Indicator of Gene OGs Gene OGs Indicator Descriptions Probability COG0463 Glycosyltransferases involved in cell wall biogenesis COG0642 Signal transduction histidine kinase COG0582 Integrase COG1132 ABC-type multidrug transport system, ATPase and permease components" COG0438 Glycosyltransf erase COG0745 Response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain COG1396 Predicted transcriptional regulators COG0577 ABC-type antimicrobial peptide transport system, permease component COG2207 AraC-type DNA-binding domaincontaining proteins COG3250 Beta-galactosidase/beta-glucuronidase eaga ac a gucuo

34 Illustration of the background topic of KEGG Pathway Indicators Background Topic - KEGG Pathway Indicator Pathway Map ID Descriptions Probability map00230 map00051 map00500 map00240 map00350 map00260 map00010 map00620 map00251 map00550 Metabolism_Nucleotide Metabolism_Purine metabolism Metabolism_Carbohydrate Metabolism_Fructose and mannose metabolism Metabolism_Carbohydrate Metabolism_Starch and sucrose metabolism Metabolism_Nucleotide Metabolism_Pyrimidine metabolism Metabolism_Amino Acid Metabolism_Tyrosine metabolism Metabolism_Amino i Acid Metabolism_"Glycine, i serine and threonine metabolism" Metabolism_Carbohydrate Metabolism_Glycolysis / Gluconeogenesis Metabolism_Carbohydrate Metabolism_Pyruvate metabolism Metabolism_Amino Acid Metabolism_Glutamate metabolism Metabolism_Glycan Biosynthesis and Metabolism_Peptidoglycan biosynthesis

35 Uncovered latent topics with respect to NCBI taxonomic indicators Illustration of the most relevant latent topics with respect to different taxa Topic ID MI Score Topic ID MI Score Topic ID MI Score family_enter obacteriaceae Topic Topic Topic genus_clostri dium Topic Topic Topic genus_bacter oides Topic Topic Topic phylum_bact eroidetes Topic Topic Topic phylum_firm icutes Topic Topic Topic Discoveries: For each taxon, latent topics are sorted with respect to the mutual information score (MI score). The MI severs as a relevance measurement between taxa and latent topics. It shows that phylum Firmicutes is most relevant to the background topic (Topic 0). Similarly, genus Clostridium is most relevant to Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77,

36 Uncovered latent topics with respect to NCBI taxonomic indicators MH0001 Illustration of top-ranked latent topics with respect to different microbial samples p(topic sampl e) O2.UC-1 p(topic sampl e) V1.CD-1 p(topic sampl e) Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 in MH0001 and in O2.UC-1) is much higher than that in CD samples (0.286 in V1.CD-1). This suggests that for CD samples, the proportion of bacteria belong to phylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 in samples O2.UC-1 and sample V1.CD-1 1 may indicate the existence and possibly high abundance of genus Clostridium and genus Bacteroides, correspondingly. 36

37 Uncovered latent topics with respect to NCBI taxonomic indicators 37

38 Summary of Discoveries Our discoveries from the results is evidenced by the recent discoveries i in fecal microbiota study of inflammatory bowel disease (IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C et al., 2006], [Walker A. et. al. 2011]. It has been reported that there is a significant reduction in the proportion of bacteria belonging to phylum Firmicutes in CD samples, which is consistent with our results. This can be explained by the fact mucosal microbial diversity is reduced in IBDs, particular in CD, which is associated with bacterial invasion of the mucosa. In UC, the inflammation is typically more superficial; therefore, the reduction of phylum Firmicutes in UC is not significant. 38

39 Conclusions Based on the functional elements derived from the nonredundant CDs catalogue, we have shown that the configuration of functional groups encoded in the gene- expression data of meta-genome samples can be inferred by applying probabilistic topic modeling to functional elements derived from the non-redundant CDs catalogue. The latent topics estimated from human gut microbial samples are evidenced by the recent discoveries in fecal microbiota study, which demonstrate the effectiveness of the proposed method. 39

40 Future work In the proposed model, the number of functional group has to be specified in advance, or iteratively tuned by criteria such as log-likelihood and perplexity. In future work, we propose to use nonparametric hierarchical Bayesian models (such as HDP model) to handle the uncertainty in the number of functional groups, which provide the flexibility of modeling microbial sequences with unknown functional group numbers. 40

41 Questions? 41

42 Backup Slides 42

43 Mutual Information After estimating the topic model and assigning a latent topic to each functional element, the relevance between latent topics and functional element indicators (i.e. NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators) can be obtained by calculating l the mutual information (MI) between functional element indicators and obtained latent topics based on the final latent topic assignments to functional elements. pr ( g, Zt) MI( Rg, Zt) = p( Rg, Zt)log pr ( ) pz ( ) in which R g and Z t are binary indicator variables corresponding to the functional element and the latent topic, respectively. The variable pair (R g,z t ) indicates whether a latent topic has been assigned to a specific functional element. g t 43

44 Likelihood Comparison T p( w z) = p( zt, ϕz ) p( ϕ ) t z z t t dϕ ϕ zt w z t t = 1 T ( wi) ( wi) T Γ( Wβ) Γ ( nt + β ) ( ) ( 0 ) w Γ Wη Γ n + η i wi =.. W () W () ( β) Γ t= 1 Γ ( n t + Wβ) Γ( η) Γ ( n 0 + Wη) 44

45 Likelihood Comparison (continue) T p( w z) = p( zt, ϕz ) p( ϕ ) t z z t t dϕ ϕ zt w z t t = 1 T ( wi) ( wi) T Γ( Wβ) Γ ( nt + β ) ( ) ( 0 ) w Γ Wη Γ n + η i wi =.. W () W () ( β) Γ t= 1 Γ ( n t + Wβ) Γ( η) Γ ( n 0 + Wη) 45

46 Perplexity Comparison The perplexity is calculated for held-out testing data. In our experiment, we use a 50% subset of the functional elements as training data and the other 50% as testing data. On constructing the two subsets, we ensure that functional elements from the same sample are equally split to both subsets. In practice, it is the inverse predicted model likelihood of data in held-out testing data, using parameters inferred from the trained topic model. Thus the smaller perplexity value indicates better model fitting. perplexity( D ) = exp test log( p( w j )) j= 1 test Dtest t N j= 1 j D 46

47 Perplexity Comparison (continue) 47

48 Dirichlet Process (DP) as a Non-Parametric Mixture Models The Dirichlet Process (DP) is defined as a distribution of random probability measure G 0 ~ DP(γ, H), in which γ is a concentration parameter and H is a base measure defined on a sample space Θ. By its definition, for any finite measurable partition of Θ: {A 1,,A r }, (G 0 (A 1 ),,G, 0 (A r )) ~ Dirichlet(γ H(A ( 1 ),,, γ H(A ( r )). Dirichlet Process can also be constructed by stick-breaking construction as follows: G 0 k 1 = βδθ ( ) β (1 ), ~ (1, ) k k k = αk αi αk Beta γ i= 1 k = 1 Dirichlet process by its definition: Dirichlet process constructed by stick-breaking construction: - Data sample x i drawn from a base distribution with associated parameters Θ k The weights of mixture components β = {β k } (k=1,, ) are also refer to as β ~ GEM(γ).,in which 48

49 Hierarchical Dirichlet Process (HDP) The Hierarchical Dirichlet Process (HDP) considers G 0 ~ DP(γ, H) as a global probability measure across the corpora and defines a set of child random probability measures G j ~ DP(α 0, G 0 ) for each document j, which leads to different document-level distribution over semantic mixture components: (G j (A 1 ),,G j (A r )) ~ Dirichlet(α 0 G 0 (A 1 ),, α 0 G 0 (A r )) Each G j can also be constructed by stick-breaking construction as: G = π δθ ( ) j jk k k = 1 in whch π j ={π jk } (k=1,, ) specifies the weights of mixture component indicator k. Substitute the stick-breaking construction of G 0 and G j, it follows that: π jk,..., π jk ~ Dirichlet( α0 βk,..., α0 βk) k K1 k Kr k K1 k Kr Based on the aggregation properties of Dirichlet distribution and its connection with Beta distribution, it shows that: k 1 k π jk = π ' jk (1 π ' jl ), π ' jk ~ Beta α0βk, α0 1 βl l= 1 l= 1 It then follows that π j ~ DP(α 0, β) Stick-breaking construction of hierarchical Dirichlet process 49

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models