Catalogue with Probabilistic Topic Models

Size: px
Start display at page:

Download "Catalogue with Probabilistic Topic Models"

Transcription

1 Inferring Functional Groups from Microbial Gene Catalogue with Probabilistic Topic Models Xin Chen 1, TingTing He 2, Xiaohua Hu 1, Yuan An 1, Xindong Wu 3 1 College of Information Science and Technology, Drexel University, Philadelphia, PA 19104, USA 2 Dept. of Computer Science at Central China Normal University, Wuhan, China 3 Department of Computer Science, University of Vermont, Burlington, VT, USA 1

2 Backgrounds: Genomics Genomics refers to the analysis of genomes. A genome can be thought of as the complete set of DNA sequences that codes for the hereditary material that is passed on from generation to generation. These DNA sequences include all of the genes (the functional and physical unit of heredity passed from parent to offspring) and transcripts (the RNA copies that are the initial step in decoding the genetic information) included within the genome. Thus, genomics refers to the sequencing and analysis of all of these genomic entities, including genes and transcripts, in an organism. 2

3 Backgrounds: GenBank and NCBI In recent years we see growth of GenBank and NCBI with the advancement of gene sequencing technology. 3

4 Backgrounds: annotating algorithms As the growth of GenBank and NCBI, a lot of annotating algorithms are developed to match genomic sequences to GenBank /NCBI standard reference and attach meta-information to the sequences. 4

5 Backgrounds: meta-information The annotated meta-information involves hierarchical data such as NCBI Taxonomy and Gene Ontology. 5

6 Challenges: Metagenomics With the fast advancing sequencing techniques, large amounts of sequenced genomes and meta-genomes from uncultured microbial samples (microbe) have become available. The goal of metagenomics is to study the genome-wide gene-expression data from uncultured environment samples (like the ocean, soil and human body) and understand the underlying biological processes. 6

7 Research Questions What s the major research questions of our study? We use our data mining framework to investigate following questions: 1) Given a large number of genome fragments from an microbial samples, what genomes are there? Answering this question requires mapping the meta-genomic reads to taxonomic units (usually a homology-based sequence alignment, and this task is also known as taxonomic classification or taxonomic analysis). 2) What are the major functions of these genomes? The answers to this question involve annotating the major functional units (such as signal transduction, metabolic capacity and gene regulatory) on the genome-level (a.k.a. functional analysis). Our research objective: We aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species, tell their functional roles. 7

8 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 8

9 Structural annotation and protein encoding regions Structural annotation Annotating the regions of known open reading frames (ORF s), non-coding genes (rrna, trna, mirna), Promoters and UTR s in the DNA sequences 9

10 Structure annotation and protein encoding regions (continue) NCBI standard d reference sequences have detailed d structural annotations of both non-protein encoding regions (such as trna) and protein encoding regions (CDS) as well as the corresponding gene names (if applicable). The GenBank accession number of each reference sequence is available on each NCBI online query. 10

11 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 11

12 Functional analysis - overview Functional analysis Uncover the major gene functions related to the genomic sequences Requires explaining the biochemical activity (a.k.a. molecular function) of gene product, identifying the biology process to which the gene or gene product contribute (including information about enzyme, pathway and metabolic capabilities related to the gene). 12

13 Homology-based functional analysis(richter and Huson, 2009) Homology-based approach has been recently introduced d to achieve functional annotation for metagenomic reads (Richter and Huson, 2009). The framework begins with a homology based BLASTX algorithm to match the metagenomic fragments against the reference sequences in NCBI database. The BLASTX hits will associate fragments with related protein ID and gene names. After that, with the help of the Gene Ontology (GO) database to refer associated gene names to corresponding GO terms, thus provides an overview of gene function and products for metagenomic fragments. 13

14 Homology-based functional analysis(richter and Huson, 2009) GO terms obtained from database identifier e mapping (Richter and Huson, 2009) 14

15 Limitations with Homology-based Functional Analysis Methods 1. Homology-based approaches very much reply on the result of local l sequence alignment (such as BLAST and BLASTX) to the known open reading frames (ORF). The BLAST-like local alignment may either return hundreds of hits, or return no hits, depending on the threshold of E-value used. In the latter case, the current methods are unable to provide any functional annotation. In the former case, it usually lacks of a proper tie-breaker to further reduce the hits, which h makes the functional annotation some how ambiguous (with hundreds of probable explanation) 2. The homology-based functional annotation methods did not provide any insight about the major functional capabilities of genomes (like which gene functions are more commonly shared by strains from the same species), as there is no priority it for the annotated t GO terms. 15

16 Related topics in this presentation: Structural annotation and protein encoding regions Homology-based functional analysis Topic Models 16

17 Topic Modeling - Intuitive Intuitive Of all the sensory impressions proceeding to the brain, the visual experiences are the dominant ones. Our perception of the world around us is based essentially on the messages that reach the brain from our eyes. For a long time it was thought that the retinal image was transmitted sensory, point brain, by point to visual centers in the brain; the cerebral cortex was a visual, perception, movie screen, so to speak, upon which the image in retinal, the eye was cerebral projected. Through cortex, the discoveries of eye, Hubel cell, and Wiesel optical we now know that behind the origin of the visual perception in the nerve, brain there image is a considerably more complicated Hubel, course of Wiesel events. By following the visual impulses along their path to the various cell layers of the optical cortex, Hubel and Wiesel have been able to demonstrate that the message about the image falling on the retina undergoes a stepwise analysis in a system of nerve cells stored in columns. In this system each cell has its specific function and is responsible for a specific detail in the pattern of the retinal image. Assume the data we see is generated by some parameterized random process. g y p j g Learn the parameters that best explain the p y data. Use the model to predict (infer) new g data, based on data seen so far. 17

18 Notations Word Basic unit. Item from a vocabulary indexed by {1,...,V}. Document Sequence of N words, denoted by w = (w1,w2, w2...,wn). Collection A total t of D documents, denoted d by C = {w1,w2,...,wd}. Topic Denoted by z, the total number is K. Each topic has its unique word distribution p(w z) 18

19 Background & Existing Techniques of Generative Latent Topic Models The Naïve Bayesian model Likelihood of word w given topic z * z = p z w p z p w z arg max ( ) ( ) ( ) Word-Topic decision Prior Probability of Topic z The probabilistic latent semantic indexing (PLSI) model Assumption: Each document has a mixture of k topics. Fitting the model involves: PLSI Model (Hoffman, 2001) Estimating the topic specific word distributions p(w i i z k ) and document specific topic distributions p(z k d j ) from the corpse via maximum likelihood estimation (MLE). 19

20 Latent Dirichlet Allocation (LDA) Model (Blei, 2003) φ ~ Dir( β ) j θ d ~Dir(α) j d pz ( d)~ Multiθ ( ) j j p( w z )~ Multi( φ ) wi d i, j i, j wi = j wi w-i z-wi. d Wβ + n i, j Tα + n i,. pz (,, ) i β + n α + n In PLSI model, the topic mixture probability p(z k d j )for documents are fixed once the model is estimated. For new coming document, the model needed to be re-estimated. Thus it is not scalable. The LDA model treats the probability of latent topics for each document p(z d) and the conditional probability of words for each latent topic p(w z) as latent random variables which are subject to change when new document comes. 20

21 LDA Model Estimation - Gibbs Sampling Monte Carlo process (Griffiths, 2004) Probability of a topic being assigned to a word given other observations: pz ( = j w, w, z ) pw ( z = j, w, z ) pz ( = j w, z ) wi i -i -wi i wi -i -wi -i -wi j j j pw ( z = j, w, z ) = pw ( z= j, ϕ, w, z ) p( ϕ w, z ) dϕ = i wi -i -wi i -i -wi -i -wi α + n d d d pz ( = j w-i, z-wi ) = pz ( = j θ ) p( θ w-i, z-wi ) dθ = Tα + n d i, j d i,. β + n W β + wi i, j. n i, j j j pw ( z=, jϕ, w, z ) = ϕ i -i -wi p ( ϕ j w, z ) p ( w, z ϕ j ) p ( ϕ j ) in which -i -wi -i -wi j j p( w, z ϕ )~ Multi( ϕ ) -i -wi p( θ d w, z ) p( w, z θ d ) p( θ d ) Since and j and p( ϕ )~ Dir( β). It follows that We have j p( ϕ w-i, z-wi )~ Dir( β + n ) wi i, j -i -wi -i -wi d d p( w, z θ )~ Multi( θ ) -i -wi d p( θ )~ Dir( α) d p( θ w-i, z-wi )~ Dir( α + n ) 21 d i, j

22 Mote-Carlo process Given the word-topic posterior probability, the Monte Carlo process becomes really straightforward, which is similar to throwing dice (given the probability of each facet to appear) to determine the assignment of topics to each words for the next round. Given probability for each word: pz ( = j w, w, z ), j = 1... K wi i -i -wi New topic assignment for each word. 22

23 Statistical relationships of words and topics 23

24 An example of topic assignment to words 24

25 Experiments 25

26 Experiment: Inferring Functional Groups from Microbial Gene Catalogue with Topic Models In our experiment, based on the functional elements derived from non-redundant CDs catalogue, we show that the configuration of functional groups in meta-genome samples can be inferred by probabilistic topic modeling. The probabilistic topic modeling is a Bayesian method that is able to extract useful topical information from unlabeled data. When used to study microbial samples the functional elements (including taxonomic levels, and indicators of gene orthologous groups and KEGG pathway mappings) bear an analogy with words. Estimating the probabilistic topic model can uncover the configuration of functional groups (the latent topic) in each sample. Which may be further used to study the genotype-phenotype p connection of human disease. 26

27 Experimental Data Collection In our experiment, we conduct a probabilistic topic modeling experiment to identify functional groups from human gut microbial community data is generated by [Qin, et al. 2010], which is openly accessible via The human gut microbial samples from [Qin, et al. 2010] belong to both healthy subjects (HS) and patients with inflammatory bowel disease (IBD). Specifically, the IBD patients are from two different groups, one group with Crohn s disease (CD), and the other group with ulcerative colitis (UC). In total, there are 85 healthy samples, 15 UC samples and 12 CD samples. 27

28 Experimental Data Collection (continue) According to [Qin, et al. 2010], the Illumina GA reads from human gut microbial samples are firstly assembled into longer contigs. After that, the Glimmer program was used to predict protein-encoding sequences (CDs) from assembled contigs. The predicted CDs sequences were then aligned to each other and form a non-redundant CDs catalog (a.k.a. minimal gut genome). The non-redundant CDs catalog consists of 3,299,822 non-redundant CDs sequences with an average length of 704 bp. CDs_id: MH0001 Name: GL _ MH0001 _[Lack_ 3'-end] ]_[mrna]_ locus=scaffold96_ 9:1:1206:- Length: 1206 COG/KO: COG4799 K01966 Pathway maping: map00280,map00640 Taxonomic level: species - Eubacterium eligens 28

29 Experimental Data Collection (continue) In our experiment, three types of functional elements are derived from the non-redundant CDs catalog, i.e. the NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators. Given a non-redundant CDs sequence, its NCBI taxonomical level is obtained by carrying out BLASTP alignment against the NCBI NR database. The taxonomical level of each non-redundant CDs sequence is determined by the lowest common ancestor (LCA) based algorithm. The taxonomic abundance data for each sample can be computed by counting the indicators of NCBI taxonomical levels. l The assignments of gene orthologous indicator and KEGG pathway indicator are achieved by BLASTP alignment of the amino-acid sequence from predicted CDs to the eggnog database and KEGG database. 29

30 Experimental Data Collection (continue) NCBI Taxonomic Levels Orthologous Group Indicators Genus Genus Phylum Class Genus Clostridium Bacteroides Firmicutes Clostridia Bacillus COG0463 : Glycosyltransferases involved in cell wall biogenesis COG0642 : Signal transduction histidine kinase COG1132 : "ABC-type multidrug transport system, ATPase and permease components" COG0438 : Glycosyltransferase KEGG Pathway Indicators map00230 : Metabolism_Nucleotide Metabolism_Purine metabolism map00240 : Metabolism_Nucleotide Metabolism_Pyrimidine metabolism map00350 : Metabolism_Amino Acid Metabolism_Tyrosine metabolism The union of unique functional elements jointly defines a fixed word vocabulary. In total, t there are 647,136 NCBI taxonomic level l indicators, with a vocabulary size of 748; there are a total of 1,293,764 gene orthologous group indicators, with a vocabulary size of 4667; and there are 953,493 KEGG pathway indicators, with a vocabulary size of

31 Groups of functional elements in microbial community Given non-redundant CDs catalog, and derived functional elements, we are interested in identifying the frequent co-occurrence occurrence patterns of functional elements (a.k.a. functional groups). 31

32 Generative process of proposed p model Commonly shared functional elements across samples may suggest functional similarity and biological relevance among samples. To cover such information, a genome-wide background distribution of functional elements need to be estimated, which leads to the introduction of the background topic z 0 in topic modeling. 32

33 Illustration of the background topic of gene OGs indicators Background Topic - Indicator of Gene OGs Gene OGs Indicator Descriptions Probability COG0463 Glycosyltransferases involved in cell wall biogenesis COG0642 Signal transduction histidine kinase COG0582 Integrase COG1132 ABC-type multidrug transport system, ATPase and permease components" COG0438 Glycosyltransf erase COG0745 Response regulators consisting of a CheY-like receiver domain and a winged-helix DNA-binding domain COG1396 Predicted transcriptional regulators COG0577 ABC-type antimicrobial peptide transport system, permease component COG2207 AraC-type DNA-binding domaincontaining proteins COG3250 Beta-galactosidase/beta-glucuronidase eaga ac a gucuo

34 Illustration of the background topic of KEGG Pathway Indicators Background Topic - KEGG Pathway Indicator Pathway Map ID Descriptions Probability map00230 map00051 map00500 map00240 map00350 map00260 map00010 map00620 map00251 map00550 Metabolism_Nucleotide Metabolism_Purine metabolism Metabolism_Carbohydrate Metabolism_Fructose and mannose metabolism Metabolism_Carbohydrate Metabolism_Starch and sucrose metabolism Metabolism_Nucleotide Metabolism_Pyrimidine metabolism Metabolism_Amino Acid Metabolism_Tyrosine metabolism Metabolism_Amino i Acid Metabolism_"Glycine, i serine and threonine metabolism" Metabolism_Carbohydrate Metabolism_Glycolysis / Gluconeogenesis Metabolism_Carbohydrate Metabolism_Pyruvate metabolism Metabolism_Amino Acid Metabolism_Glutamate metabolism Metabolism_Glycan Biosynthesis and Metabolism_Peptidoglycan biosynthesis

35 Uncovered latent topics with respect to NCBI taxonomic indicators Illustration of the most relevant latent topics with respect to different taxa Topic ID MI Score Topic ID MI Score Topic ID MI Score family_enter obacteriaceae Topic Topic Topic genus_clostri dium Topic Topic Topic genus_bacter oides Topic Topic Topic phylum_bact eroidetes Topic Topic Topic phylum_firm icutes Topic Topic Topic Discoveries: For each taxon, latent topics are sorted with respect to the mutual information score (MI score). The MI severs as a relevance measurement between taxa and latent topics. It shows that phylum Firmicutes is most relevant to the background topic (Topic 0). Similarly, genus Clostridium is most relevant to Topic 50, 153, 95 and genus Bacteroides is most relevant to Topic 156, 77,

36 Uncovered latent topics with respect to NCBI taxonomic indicators MH0001 Illustration of top-ranked latent topics with respect to different microbial samples p(topic sampl e) O2.UC-1 p(topic sampl e) V1.CD-1 p(topic sampl e) Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Topic Discoveries : the probability of Topic 0 in Healthy and UC samples (0.475 in MH0001 and in O2.UC-1) is much higher than that in CD samples (0.286 in V1.CD-1). This suggests that for CD samples, the proportion of bacteria belong to phylum Firmicutes is significantly reduced. The prevalence of Topic 95 and 52 in samples O2.UC-1 and sample V1.CD-1 1 may indicate the existence and possibly high abundance of genus Clostridium and genus Bacteroides, correspondingly. 36

37 Uncovered latent topics with respect to NCBI taxonomic indicators 37

38 Summary of Discoveries Our discoveries from the results is evidenced by the recent discoveries i in fecal microbiota study of inflammatory bowel disease (IBD) patients [Gerber, 2007], [Harry S. et. al. 2006], [Manichanh C et al., 2006], [Walker A. et. al. 2011]. It has been reported that there is a significant reduction in the proportion of bacteria belonging to phylum Firmicutes in CD samples, which is consistent with our results. This can be explained by the fact mucosal microbial diversity is reduced in IBDs, particular in CD, which is associated with bacterial invasion of the mucosa. In UC, the inflammation is typically more superficial; therefore, the reduction of phylum Firmicutes in UC is not significant. 38

39 Conclusions Based on the functional elements derived from the nonredundant CDs catalogue, we have shown that the configuration of functional groups encoded in the gene- expression data of meta-genome samples can be inferred by applying probabilistic topic modeling to functional elements derived from the non-redundant CDs catalogue. The latent topics estimated from human gut microbial samples are evidenced by the recent discoveries in fecal microbiota study, which demonstrate the effectiveness of the proposed method. 39

40 Future work In the proposed model, the number of functional group has to be specified in advance, or iteratively tuned by criteria such as log-likelihood and perplexity. In future work, we propose to use nonparametric hierarchical Bayesian models (such as HDP model) to handle the uncertainty in the number of functional groups, which provide the flexibility of modeling microbial sequences with unknown functional group numbers. 40

41 Questions? 41

42 Backup Slides 42

43 Mutual Information After estimating the topic model and assigning a latent topic to each functional element, the relevance between latent topics and functional element indicators (i.e. NCBI taxonomic level indicators, indicator of gene orthologous groups and KEGG pathway indicators) can be obtained by calculating l the mutual information (MI) between functional element indicators and obtained latent topics based on the final latent topic assignments to functional elements. pr ( g, Zt) MI( Rg, Zt) = p( Rg, Zt)log pr ( ) pz ( ) in which R g and Z t are binary indicator variables corresponding to the functional element and the latent topic, respectively. The variable pair (R g,z t ) indicates whether a latent topic has been assigned to a specific functional element. g t 43

44 Likelihood Comparison T p( w z) = p( zt, ϕz ) p( ϕ ) t z z t t dϕ ϕ zt w z t t = 1 T ( wi) ( wi) T Γ( Wβ) Γ ( nt + β ) ( ) ( 0 ) w Γ Wη Γ n + η i wi =.. W () W () ( β) Γ t= 1 Γ ( n t + Wβ) Γ( η) Γ ( n 0 + Wη) 44

45 Likelihood Comparison (continue) T p( w z) = p( zt, ϕz ) p( ϕ ) t z z t t dϕ ϕ zt w z t t = 1 T ( wi) ( wi) T Γ( Wβ) Γ ( nt + β ) ( ) ( 0 ) w Γ Wη Γ n + η i wi =.. W () W () ( β) Γ t= 1 Γ ( n t + Wβ) Γ( η) Γ ( n 0 + Wη) 45

46 Perplexity Comparison The perplexity is calculated for held-out testing data. In our experiment, we use a 50% subset of the functional elements as training data and the other 50% as testing data. On constructing the two subsets, we ensure that functional elements from the same sample are equally split to both subsets. In practice, it is the inverse predicted model likelihood of data in held-out testing data, using parameters inferred from the trained topic model. Thus the smaller perplexity value indicates better model fitting. perplexity( D ) = exp test log( p( w j )) j= 1 test Dtest t N j= 1 j D 46

47 Perplexity Comparison (continue) 47

48 Dirichlet Process (DP) as a Non-Parametric Mixture Models The Dirichlet Process (DP) is defined as a distribution of random probability measure G 0 ~ DP(γ, H), in which γ is a concentration parameter and H is a base measure defined on a sample space Θ. By its definition, for any finite measurable partition of Θ: {A 1,,A r }, (G 0 (A 1 ),,G, 0 (A r )) ~ Dirichlet(γ H(A ( 1 ),,, γ H(A ( r )). Dirichlet Process can also be constructed by stick-breaking construction as follows: G 0 k 1 = βδθ ( ) β (1 ), ~ (1, ) k k k = αk αi αk Beta γ i= 1 k = 1 Dirichlet process by its definition: Dirichlet process constructed by stick-breaking construction: - Data sample x i drawn from a base distribution with associated parameters Θ k The weights of mixture components β = {β k } (k=1,, ) are also refer to as β ~ GEM(γ).,in which 48

49 Hierarchical Dirichlet Process (HDP) The Hierarchical Dirichlet Process (HDP) considers G 0 ~ DP(γ, H) as a global probability measure across the corpora and defines a set of child random probability measures G j ~ DP(α 0, G 0 ) for each document j, which leads to different document-level distribution over semantic mixture components: (G j (A 1 ),,G j (A r )) ~ Dirichlet(α 0 G 0 (A 1 ),, α 0 G 0 (A r )) Each G j can also be constructed by stick-breaking construction as: G = π δθ ( ) j jk k k = 1 in whch π j ={π jk } (k=1,, ) specifies the weights of mixture component indicator k. Substitute the stick-breaking construction of G 0 and G j, it follows that: π jk,..., π jk ~ Dirichlet( α0 βk,..., α0 βk) k K1 k Kr k K1 k Kr Based on the aggregation properties of Dirichlet distribution and its connection with Beta distribution, it shows that: k 1 k π jk = π ' jk (1 π ' jl ), π ' jk ~ Beta α0βk, α0 1 βl l= 1 l= 1 It then follows that π j ~ DP(α 0, β) Stick-breaking construction of hierarchical Dirichlet process 49

Latent Dirichlet Allocation Introduction/Overview

Latent Dirichlet Allocation Introduction/Overview Latent Dirichlet Allocation Introduction/Overview David Meyer 03.10.2016 David Meyer http://www.1-4-5.net/~dmm/ml/lda_intro.pdf 03.10.2016 Agenda What is Topic Modeling? Parametric vs. Non-Parametric Models

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

MiGA: The Microbial Genome Atlas

MiGA: The Microbial Genome Atlas December 12 th 2017 MiGA: The Microbial Genome Atlas Jim Cole Center for Microbial Ecology Dept. of Plant, Soil & Microbial Sciences Michigan State University East Lansing, Michigan U.S.A. Where I m From

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Online Bayesian Passive-Agressive Learning

Online Bayesian Passive-Agressive Learning Online Bayesian Passive-Agressive Learning International Conference on Machine Learning, 2014 Tianlin Shi Jun Zhu Tsinghua University, China 21 August 2015 Presented by: Kyle Ulrich Introduction Online

More information

Bayesian Nonparametrics for Speech and Signal Processing

Bayesian Nonparametrics for Speech and Signal Processing Bayesian Nonparametrics for Speech and Signal Processing Michael I. Jordan University of California, Berkeley June 28, 2011 Acknowledgments: Emily Fox, Erik Sudderth, Yee Whye Teh, and Romain Thibaux Computer

More information

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802 Mutual Information & Genotype-Phenotype Association Norman MacDonald January 31, 2011 CSCI 4181/6802 2 Overview What is information (specifically Shannon Information)? What are information entropy and

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up

Topic Models. Brandon Malone. February 20, Latent Dirichlet Allocation Success Stories Wrap-up Much of this material is adapted from Blei 2003. Many of the images were taken from the Internet February 20, 2014 Suppose we have a large number of books. Each is about several unknown topics. How can

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

User-Tagged Image Modeling

User-Tagged Image Modeling Perspective Hierarchical Dirichlet Process for User-Tagged Image Modeling Xin Chen 1, Xiaohua Hu 1, Yuan An 1, Zunyan Xiong 1, Tingting He 2, E.K. Park 3 1 College of Information Science and Technology,

More information

Study Notes on the Latent Dirichlet Allocation

Study Notes on the Latent Dirichlet Allocation Study Notes on the Latent Dirichlet Allocation Xugang Ye 1. Model Framework A word is an element of dictionary {1,,}. A document is represented by a sequence of words: =(,, ), {1,,}. A corpus is a collection

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem University of Groningen Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's

More information

Istituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program.

Istituto di Microbiologia. Università Cattolica del Sacro Cuore, Roma. Gut Microbiota assessment and the Meta-HIT program. Istituto di Microbiologia Università Cattolica del Sacro Cuore, Roma Gut Microbiota assessment and the Meta-HIT program Giovanni Delogu 1 Most of the bacteria species living in the gut cannot be cultivated

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

CS Lecture 18. Topic Models and LDA

CS Lecture 18. Topic Models and LDA CS 6347 Lecture 18 Topic Models and LDA (some slides by David Blei) Generative vs. Discriminative Models Recall that, in Bayesian networks, there could be many different, but equivalent models of the same

More information

Topic Modelling and Latent Dirichlet Allocation

Topic Modelling and Latent Dirichlet Allocation Topic Modelling and Latent Dirichlet Allocation Stephen Clark (with thanks to Mark Gales for some of the slides) Lent 2013 Machine Learning for Language Processing: Lecture 7 MPhil in Advanced Computer

More information

BMD645. Integration of Omics

BMD645. Integration of Omics BMD645 Integration of Omics Shu-Jen Chen, Chang Gung University Dec. 11, 2009 1 Traditional Biology vs. Systems Biology Traditional biology : Single genes or proteins Systems biology: Simultaneously study

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

Flow of Genetic Information

Flow of Genetic Information presents Flow of Genetic Information A Montagud E Navarro P Fernández de Córdoba JF Urchueguía Elements Nucleic acid DNA RNA building block structure & organization genome building block types Amino acid

More information

Supplementary Information

Supplementary Information Supplementary Information Supplementary Figure 1. Schematic pipeline for single-cell genome assembly, cleaning and annotation. a. The assembly process was optimized to account for multiple cells putatively

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Identifying Bacterial Strains with Sequencing Data using Probabilistic Models

Identifying Bacterial Strains with Sequencing Data using Probabilistic Models Identifying Bacterial Strains with Sequencing Data using Probabilistic Models Helsinki Institute for Information Technology Department of Computer Science, University of Helsinki September 25, 2014 Motivation

More information

Taxonomical Classification using:

Taxonomical Classification using: Taxonomical Classification using: Extracting ecological signal from noise: introduction to tools for the analysis of NGS data from microbial communities Bergen, April 19-20 2012 INTRODUCTION Taxonomical

More information

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014

Assigning Taxonomy to Marker Genes. Susan Huse Brown University August 7, 2014 Assigning Taxonomy to Marker Genes Susan Huse Brown University August 7, 2014 In a nutshell Taxonomy is assigned by comparing your DNA sequences against a database of DNA sequences from known taxa Marker

More information

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution

Taxonomy. Content. How to determine & classify a species. Phylogeny and evolution Taxonomy Content Why Taxonomy? How to determine & classify a species Domains versus Kingdoms Phylogeny and evolution Why Taxonomy? Classification Arrangement in groups or taxa (taxon = group) Nomenclature

More information

Replicated Softmax: an Undirected Topic Model. Stephen Turner

Replicated Softmax: an Undirected Topic Model. Stephen Turner Replicated Softmax: an Undirected Topic Model Stephen Turner 1. Introduction 2. Replicated Softmax: A Generative Model of Word Counts 3. Evaluating Replicated Softmax as a Generative Model 4. Experimental

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION ARTICLE NUMBER: 16161 DOI: 10.1038/NMICROBIOL.2016.161 A reference gene catalogue of the pig gut microbiome Liang Xiao 1, Jordi Estellé 2, Pia Kiilerich 3, Yuliaxis Ramayo-Caldas

More information

Text Mining for Economics and Finance Latent Dirichlet Allocation

Text Mining for Economics and Finance Latent Dirichlet Allocation Text Mining for Economics and Finance Latent Dirichlet Allocation Stephen Hansen Text Mining Lecture 5 1 / 45 Introduction Recall we are interested in mixed-membership modeling, but that the plsi model

More information

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process

19 : Bayesian Nonparametrics: The Indian Buffet Process. 1 Latent Variable Models and the Indian Buffet Process 10-708: Probabilistic Graphical Models, Spring 2015 19 : Bayesian Nonparametrics: The Indian Buffet Process Lecturer: Avinava Dubey Scribes: Rishav Das, Adam Brodie, and Hemank Lamba 1 Latent Variable

More information

Dirichlet Enhanced Latent Semantic Analysis

Dirichlet Enhanced Latent Semantic Analysis Dirichlet Enhanced Latent Semantic Analysis Kai Yu Siemens Corporate Technology D-81730 Munich, Germany Kai.Yu@siemens.com Shipeng Yu Institute for Computer Science University of Munich D-80538 Munich,

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Applying LDA topic model to a corpus of Italian Supreme Court decisions

Applying LDA topic model to a corpus of Italian Supreme Court decisions Applying LDA topic model to a corpus of Italian Supreme Court decisions Paolo Fantini Statistical Service of the Ministry of Justice - Italy CESS Conference - Rome - November 25, 2014 Our goal finding

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Microbiome: 16S rrna Sequencing 3/30/2018

Microbiome: 16S rrna Sequencing 3/30/2018 Microbiome: 16S rrna Sequencing 3/30/2018 Skills from Previous Lectures Central Dogma of Biology Lecture 3: Genetics and Genomics Lecture 4: Microarrays Lecture 12: ChIP-Seq Phylogenetics Lecture 13: Phylogenetics

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Data Mining Techniques

Data Mining Techniques Data Mining Techniques CS 622 - Section 2 - Spring 27 Pre-final Review Jan-Willem van de Meent Feedback Feedback https://goo.gl/er7eo8 (also posted on Piazza) Also, please fill out your TRACE evaluations!

More information

Bayes methods for categorical data. April 25, 2017

Bayes methods for categorical data. April 25, 2017 Bayes methods for categorical data April 25, 2017 Motivation for joint probability models Increasing interest in high-dimensional data in broad applications Focus may be on prediction, variable selection,

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

METABOLIC PATHWAY PREDICTION/ALIGNMENT

METABOLIC PATHWAY PREDICTION/ALIGNMENT COMPUTATIONAL SYSTEMIC BIOLOGY METABOLIC PATHWAY PREDICTION/ALIGNMENT Hofestaedt R*, Chen M Bioinformatics / Medical Informatics, Technische Fakultaet, Universitaet Bielefeld Postfach 10 01 31, D-33501

More information

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name.

A. Incorrect! In the binomial naming convention the Kingdom is not part of the name. Microbiology Problem Drill 08: Classification of Microorganisms No. 1 of 10 1. In the binomial system of naming which term is always written in lowercase? (A) Kingdom (B) Domain (C) Genus (D) Specific

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank

A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank A Unified Posterior Regularized Topic Model with Maximum Margin for Learning-to-Rank Shoaib Jameel Shoaib Jameel 1, Wai Lam 2, Steven Schockaert 1, and Lidong Bing 3 1 School of Computer Science and Informatics,

More information

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng

Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS. Elizabeth Tseng Microbes and you ON THE LATEST HUMAN MICROBIOME DISCOVERIES, COMPUTATIONAL QUESTIONS AND SOME SOLUTIONS Elizabeth Tseng Dept. of CSE, University of Washington Johanna Lampe Lab, Fred Hutchinson Cancer

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Microbial Taxonomy and the Evolution of Diversity

Microbial Taxonomy and the Evolution of Diversity 19 Microbial Taxonomy and the Evolution of Diversity Copyright McGraw-Hill Global Education Holdings, LLC. Permission required for reproduction or display. 1 Taxonomy Introduction to Microbial Taxonomy

More information

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007 Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.

More information

Bayesian Nonparametrics: Models Based on the Dirichlet Process

Bayesian Nonparametrics: Models Based on the Dirichlet Process Bayesian Nonparametrics: Models Based on the Dirichlet Process Alessandro Panella Department of Computer Science University of Illinois at Chicago Machine Learning Seminar Series February 18, 2013 Alessandro

More information

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster. NCBI BLAST Services DELTA-BLAST BLAST (http://blast.ncbi.nlm.nih.gov/), Basic Local Alignment Search tool, is a suite of programs for finding similarities between biological sequences. DELTA-BLAST is a

More information

Microbial analysis with STAMP

Microbial analysis with STAMP Microbial analysis with STAMP Conor Meehan cmeehan@itg.be A quick aside on who I am Tangents already! Who I am A postdoc at the Institute of Tropical Medicine in Antwerp, Belgium Mycobacteria evolution

More information

Document and Topic Models: plsa and LDA

Document and Topic Models: plsa and LDA Document and Topic Models: plsa and LDA Andrew Levandoski and Jonathan Lobo CS 3750 Advanced Topics in Machine Learning 2 October 2018 Outline Topic Models plsa LSA Model Fitting via EM phits: link analysis

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu "Microbiota and man: the story about us

Microbiota: Its Evolution and Essence. Hsin-Jung Joyce Wu Microbiota and man: the story about us Microbiota: Its Evolution and Essence Overview q Define microbiota q Learn the tool q Ecological and evolutionary forces in shaping gut microbiota q Gut microbiota versus free-living microbe communities

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Sparse Stochastic Inference for Latent Dirichlet Allocation

Sparse Stochastic Inference for Latent Dirichlet Allocation Sparse Stochastic Inference for Latent Dirichlet Allocation David Mimno 1, Matthew D. Hoffman 2, David M. Blei 1 1 Dept. of Computer Science, Princeton U. 2 Dept. of Statistics, Columbia U. Presentation

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) D. Blei, A. Ng, and M. Jordan. Journal of Machine Learning Research, 3:993-1022, January 2003. Following slides borrowed ant then heavily modified from: Jonathan Huang

More information

Kernel Density Topic Models: Visual Topics Without Visual Words

Kernel Density Topic Models: Visual Topics Without Visual Words Kernel Density Topic Models: Visual Topics Without Visual Words Konstantinos Rematas K.U. Leuven ESAT-iMinds krematas@esat.kuleuven.be Mario Fritz Max Planck Institute for Informatics mfrtiz@mpi-inf.mpg.de

More information

Introduction To Machine Learning

Introduction To Machine Learning Introduction To Machine Learning David Sontag New York University Lecture 21, April 14, 2016 David Sontag (NYU) Introduction To Machine Learning Lecture 21, April 14, 2016 1 / 14 Expectation maximization

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Information retrieval LSI, plsi and LDA. Jian-Yun Nie

Information retrieval LSI, plsi and LDA. Jian-Yun Nie Information retrieval LSI, plsi and LDA Jian-Yun Nie Basics: Eigenvector, Eigenvalue Ref: http://en.wikipedia.org/wiki/eigenvector For a square matrix A: Ax = λx where x is a vector (eigenvector), and

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

Genome Annotation Project Presentation

Genome Annotation Project Presentation Halogeometricum borinquense Genome Annotation Project Presentation Loci Hbor_05620 & Hbor_05470 Presented by: Mohammad Reza Najaf Tomaraei Hbor_05620 Basic Information DNA Coordinates: 527,512 528,261

More information

Bayesian Nonparametrics: Dirichlet Process

Bayesian Nonparametrics: Dirichlet Process Bayesian Nonparametrics: Dirichlet Process Yee Whye Teh Gatsby Computational Neuroscience Unit, UCL http://www.gatsby.ucl.ac.uk/~ywteh/teaching/npbayes2012 Dirichlet Process Cornerstone of modern Bayesian

More information

Distributed ML for DOSNs: giving power back to users

Distributed ML for DOSNs: giving power back to users Distributed ML for DOSNs: giving power back to users Amira Soliman KTH isocial Marie Curie Initial Training Networks Part1 Agenda DOSNs and Machine Learning DIVa: Decentralized Identity Validation for

More information

Non-parametric Clustering with Dirichlet Processes

Non-parametric Clustering with Dirichlet Processes Non-parametric Clustering with Dirichlet Processes Timothy Burns SUNY at Buffalo Mar. 31 2009 T. Burns (SUNY at Buffalo) Non-parametric Clustering with Dirichlet Processes Mar. 31 2009 1 / 24 Introduction

More information

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression

Lesson Overview. Gene Regulation and Expression. Lesson Overview Gene Regulation and Expression 13.4 Gene Regulation and Expression THINK ABOUT IT Think of a library filled with how-to books. Would you ever need to use all of those books at the same time? Of course not. Now picture a tiny bacterium

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms Bioinformatics Advance Access published March 7, 2007 The Author (2007). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

More information

Inferring Transcriptional Regulatory Networks from Gene Expression Data II

Inferring Transcriptional Regulatory Networks from Gene Expression Data II Inferring Transcriptional Regulatory Networks from Gene Expression Data II Lectures 9 Oct 26, 2011 CSE 527 Computational Biology, Fall 2011 Instructor: Su-In Lee TA: Christopher Miles Monday & Wednesday

More information

Applying hlda to Practical Topic Modeling

Applying hlda to Practical Topic Modeling Joseph Heng lengerfulluse@gmail.com CIST Lab of BUPT March 17, 2013 Outline 1 HLDA Discussion 2 the nested CRP GEM Distribution Dirichlet Distribution Posterior Inference Outline 1 HLDA Discussion 2 the

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1

Topic Models. Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW. Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Topic Models Advanced Machine Learning for NLP Jordan Boyd-Graber OVERVIEW Advanced Machine Learning for NLP Boyd-Graber Topic Models 1 of 1 Low-Dimensional Space for Documents Last time: embedding space

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Networks & pathways. Hedi Peterson MTAT Bioinformatics Networks & pathways Hedi Peterson (peterson@quretec.com) MTAT.03.239 Bioinformatics 03.11.2010 Networks are graphs Nodes Edges Edges Directed, undirected, weighted Nodes Genes Proteins Metabolites Enzymes

More information

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy

Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy Understanding Sequence, Structure and Function Relationships and the Resulting Redundancy many slides by Philip E. Bourne Department of Pharmacology, UCSD Agenda Understand the relationship between sequence,

More information

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes

Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Sharing Clusters Among Related Groups: Hierarchical Dirichlet Processes Yee Whye Teh (1), Michael I. Jordan (1,2), Matthew J. Beal (3) and David M. Blei (1) (1) Computer Science Div., (2) Dept. of Statistics

More information

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text

Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Learning the Semantic Correlation: An Alternative Way to Gain from Unlabeled Text Yi Zhang Machine Learning Department Carnegie Mellon University yizhang1@cs.cmu.edu Jeff Schneider The Robotics Institute

More information

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON PROKARYOTE GENES: E. COLI LAC OPERON CHAPTER 13 CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON Figure 1. Electron micrograph of growing E. coli. Some show the constriction at the location where daughter

More information

Lecture 3a: Dirichlet processes

Lecture 3a: Dirichlet processes Lecture 3a: Dirichlet processes Cédric Archambeau Centre for Computational Statistics and Machine Learning Department of Computer Science University College London c.archambeau@cs.ucl.ac.uk Advanced Topics

More information

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign sun22@illinois.edu Hongbo Deng Department of Computer Science University

More information

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations

Pachinko Allocation: DAG-Structured Mixture Models of Topic Correlations : DAG-Structured Mixture Models of Topic Correlations Wei Li and Andrew McCallum University of Massachusetts, Dept. of Computer Science {weili,mccallum}@cs.umass.edu Abstract Latent Dirichlet allocation

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Latent Dirichlet Allocation (LDA)

Latent Dirichlet Allocation (LDA) Latent Dirichlet Allocation (LDA) A review of topic modeling and customer interactions application 3/11/2015 1 Agenda Agenda Items 1 What is topic modeling? Intro Text Mining & Pre-Processing Natural Language

More information

Image segmentation combining Markov Random Fields and Dirichlet Processes

Image segmentation combining Markov Random Fields and Dirichlet Processes Image segmentation combining Markov Random Fields and Dirichlet Processes Jessica SODJO IMS, Groupe Signal Image, Talence Encadrants : A. Giremus, J.-F. Giovannelli, F. Caron, N. Dobigeon Jessica SODJO

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide Name: Period: Date: AP Bio Module 6: Bacterial Genetics and Operons, Student Learning Guide Getting started. Work in pairs (share a computer). Make sure that you log in for the first quiz so that you get

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information