Computation-Based Discovery of Cis-Regulatory. Modules by Hidden Markov Model

Size: px
Start display at page:

Download "Computation-Based Discovery of Cis-Regulatory. Modules by Hidden Markov Model"

Transcription

1 Computation-Based Discovery of Cis-Regulatory Modules by Hidden Markov Model Jing Wu and Jun Xie Department of Statistics Purdue University 150 N. University Street West Lafayette, IN Tel: Fax: June 13, 2006 Running head: Discovery of CRM by HMM Key Words: Bayesian Inference; Cis-regulatory Module; Hidden Markov Model; Transcription factor binding site.

2 Abstract A key component in genome sequence analysis is the identification of regions of the genome that contain regulatory information. In higher eukaryotes, this information is organized into modular units called cis-regulatory modules. Each module contains multiple binding sites for a specific combination of several transcription factors. In this article, we propose a hidden Markov model (HMM) to identify transcription factor binding sites (TFBSs) and cis-regulatory modules (CRMs). For a given genomic sequence, we first select potential TFBSs from a large database (e.g., TRANSFAC), then construct an HMM where the TFBSs are only counted when they occur within a specialized CRM state. The novel features of the proposed method include that it does not assume a small set of TFBSs for a given gene, on the other hand, the method utilizes information from a large collection of well-characterized TFBSs and therefore is computationally more efficient and robust than the de novo methods. Our approach is applied to three data sets with experimentally evaluated TFBSs. The method shows better specificity and sensitivity than other similar computational tools in identifying CRMs and TFBSs. The executable codes of our programs and module predictions across the fly Drosophila genome are available at jingwu/module/. 1 Introduction Gene expression is regulated through the interaction of transcription factors (TFs) with specific cis-regulatory DNA sequences. Recent studies suggest a modular organization of binding sites for transcription factors (Yuh et al. 1998). More specifically, the expression level of a gene is determined by a cooperation of several TFs, whose sites are organized in a modular unit, called a cis-regulatory module (CRM). A major challenge in analyzing genomic 2

3 sequences is characterizing functional combinations of TF binding sites. The identification of clusters of TF binding sites will aid understandings of the rules governing gene regulation. A TF binding site motif is commonly modeled by a position weight matrix (PWM). Based on an alignment of known sites, a PWM is constructed, whose columns define the probabilities of observing each nucleotide at each position. Using the PWMs to scan genomic sequences, some computational and statistical methods were developed for identifying pairs of TFs that tend to co-occur in close proximity in sequences (Elkon et al. 2003). In addition, approaches of predicting clusters of TFBSs were proposed, including the logistic regression method of Wasserman and Fickett (1998), and a program called CREME (Sharan et al. 2003). These programs consider motifs as a cluster if they lie within a window of a certain length. The programs search for PWMs one by one, and therefore do not attempt to model the modular structure of TFBSs. More recently, hidden Markov models (HMMs) were employed to detect CRMs (Crowley et al., 1997; Frith et al., 2001; Bailey and Noble 2003; Rajewsky et al., 2002; Sinha et al., 2003). HMM, which originated in speech recognition, provides a good stochastic machinery to produce the observed sequences from an underlying model with clusters of binding site motifs. While numerous algorithms already exist, they have several limitations. First, most algorithms focused on a small number of known TFBSs whose PWMs need to be clearly defined before the searches. Secondly, all the algorithms require the user to specify a module window size for the search. Formally, a genomic sequence is parsed into a series of overlapping windows of length l, whose starting positions differ by a specified size δ. The windows with high scores are predicted as CRMs. Since CRMs are known to vary greatly in size, these HMM algorithms could be improved by allowing for variable window lengths. We also want to eliminate the extra parameter δ which could affect predictions. 3

4 There are another set of de novo approaches, including the Gibbs motif sampler (Lawrence et al. 1993), AlignACE (Roth et al. 1998), BioProspector (Liu et al. 2001) and CisModule (Zhou and Wong 2004). These programs do not utilize any known transcription factor PWMs but try to discover conserved sequences de novo. Though novel motifs may be found, these programs depend on a good collection of multiple genomic sequences that are assumed to contain a common TF binding site motif. Besides, the identified sequence motifs may not match any known TFBS therefore do not provide direct binding information. In this sense, the de novo approaches could be compensated by utilizing available TFBS patterns. Standard Gibbs sampling approaches were developed for aligning single TF motifs (Lawrence et al. 1993; Liu et al. 2001). Zhou and Wong (2004) extended the method to obtain a Bayesian hierarchical model of cis-regulatory module. Their program CisModule could provide simultaneous inference of modules and TFBSs for appropriate sequence sets. Nevertheless, without using the prior information of known PWMs, CisModule estimates a large set of parameters and therefore would often encounter convergence problems. Moreover, Cis- Module searches for TF motifs within a window of given module size, e.g. 100 or 150 base pairs. Determination of the module length best for a data set could be difficult. CisModule has an alternative of using a Metropolis update to determine the module length. However, the length has to be fixed for all modules once it is decided. In this paper we develop a method that combines the good features of HMMs and the flexibility of Bayesian inferences for identifying CRMs and individual TFBSs. We attempt to improve both the existing HMM and the de novo methods by an intermediate approach. First, we select potential TFBSs for each genomic sequence from a large pool of available PWMs, for instance, the TRANSFAC database (Wingender et al. 2000). TRANSFAC is the most complete database of carefully evaluated binding sites, where over 400 experimen- 4

5 tally derived PWMs are collected. Using all PWMs in TRANSFAC, we systematically scan each genomic sequence in the data set. The statistically significant PWMs will be used to construct an HMM architecture. Another feature of our HMM is to define a hidden state for modules in addition to the states of TFBS motifs. A module consists of several TFBSs and the background sequence connecting them. By defining the module state, only TFBSs that occur within the module state are counted. In addition, the module locations and lengths will be automatically determined. Each module will have its own length most appropriate for the genomic sequence. Furthermore, our HMM is estimated by Bayesian inferences through sampling. The emission probabilities of the HMM are assumed known from the selected PWMs. The remaining unknown parameters are a small set of transition probabilities, which are estimated by maximizing the posterior distribution conditional on the observed genomic sequences. We detect CRMs and TFBSs that give high posterior probabilities using simulated samples. A cis-regulatory module typically locates in genomic sequence near the genes it regulates. Therefore, we focus the applications on promoter sequences which are several hundreds bp near the transcription starting sites (TSSs), for instance, 700 bp upstream and 200 bp downstream of the TSS. On the other hand, the proposed approach works well on long genomic sequences. One of our examples was to predict CRMs in the whole genome of the fly Drosophila. We searched for embryo developmental cis-regulatory elements in genomic sequences of 24 kp long. The extra module state defined in our HMM automatically predict CRMs as islands in the long sequences. Our HMM structure shows an advantage over the existing HMMs which consist of only TFBS states. Our HMM approach was applied in two additional examples, including promoter sequences of E2F target genes, and promoter sequences of muscle specific regulatory regions. The applications showed that our method 5

6 improved predictions of motifs and CRMs and was computationally efficient. 2 Method 2.1 Selecting candidate PWMs The structure of an HMM depends on a set of specified PWMs that correspond to all possible TFBSs for a given sequence. In our approach, the possible TFBSs are systematically selected from a database of PWMs (e.g. TRANSFAC). Specifically, for a given sequence, we choose the PWMs that have significant scores compared with the scores obtained by random chances. For a single sequence X = {x 1,..., x L }, we consider both the forward strand and its reverse-complement when selecting PWMs. Denote the reverse-complement of X as {x L+1,..., x 2L }. Let P be a given PWM with width w. At position i in the extended sequence, the log odds score of the PWM is defined as s(i) =log P (x i...x i+w 1 ) Q(x i...x i+w 1 ), where Q is a probability matrix for the background. Then we define score S as the maximum value of the log odds score from the PWM: S = max (s(i)). (1) i=1,...,2(l w+1) Under the null hypothesis that there is no instance of the transcription factor binding site P, the null distribution of the score S of the PWM can be estimated by Monte Carlo simulations from a set of randomly selected sequences of length L. More specifically, for the given PWM P, we first calculate the maximum score S 0 in the observed sequence X by equation (1). Then for a set of randomly selected sequences, say B = 200 random sequences, 6

7 each with the same length as the observed sequence, the maximum scores of the PWM are calculated as S b, b =1,..., B. The p-value of the score S 0 is estimated by p = b I(S b>s 0 ) B, where I( ) is the indicator function. The PWMs from a database (e.g. TRANSFAC) with p-values less than 0.05 are selected for each genomic sequence. This typically results in K 20 PWMs for a sequence of length Hidden Markov model A genomic sequence can be considered as an observation from an HMM with the hidden path indicating the module and within-module TFBSs. We introduce the hidden states, the transition probabilities, the emission probabilities, and the likelihood in the following. First, the module and the outside background are defined by two hidden states in the model, i.e. the two circles in Figure 1. Suppose there are at most K possible different TFs in a given sequence. Then K additional hidden states are used to indicate the TFBSs. As a simple example, Figure 1 illustrates an HMM with K = 3 possible TFBS states, denoted by PWMs, and states K + 1 and 0 indicate the background states inside and outside of modules respectively. The hidden path of a module consists of several TFBSs from the set of K possible PWMs, which are connected by the background sequences of state K +1. The modules are further connected by sequences of state 0 to obtain the hidden path of the whole genomic sequence. The transition probabilities between the hidden states are considered as unknown parameters. The notations are specified as follows. At each position of the observed sequence, given the previous state being in the background (state 0), there is a probability r of initiating a module and probability 1 r of staying in the background. If a module starts at a position, then its hidden state becomes K + 1, i.e. the background inside the module, and a series of 7

8 q K+1 q 3 r 0 K+1 1 PWM3 1 1-r q 0 1 q 1 q 2 PWM2 PWM1 Figure 1: A simplified hidden Markov model with K = 3 possible TFBSs. State 0: background outside of modules; State K + 1: background inside a module. Transition probabilities are indicated at the arrows. transitions will happen within the module. From state K + 1, there is a probability q k of initiating the kth motif site and consequently the following w k positions will have the hidden state PWM k,wherew k is the width of the kth motif. In addition, there is a probability q K+1 of staying in the module background state K + 1 and a probability q 0 of terminating the module, changing from the module to the outside background state 0. The transition probabilities satisfy K+1 k=0 q k = 1. Figure 1 shows the structure of the hidden Markov chain with the transition probabilities indicated at the arrows. The emission probabilities are assumed known. Each PWM k inside a module is selected from a database (e.g. TRANSFAC), therefore its position-specific probabilities Θ k and motif width w k are given. For the regions outside the modules and the background regions within the modules, i.e. states 0 s and states K + 1 s respectively, the emission probability is from a first-order Markov chain with parameter θ 0. This parameter can be easily estimated from the observed sequence data. Under the HMM, the complete sequence likelihood is defined. For notation simplicity, we only consider a single observed sequence. For a group of sequences, we assume sequences are independent. Therefore, the joint probability becomes the product of the probability for each 8

9 sequence and the estimation procedure is done one by one sequence. Let X denote a single sequence. Let M be the module indicators and A = {A 1,..., A K,A K+1 } the TFBS indicators, where A k, k =1,..., K, is the indicators of the binding sites for the k-th motif and A K+1 is the indicator of the non-site background sequences inside the modules. We use X(M c ) to denote the background sequence outside of the modules. Denote Θ = {θ 0, Θ 1,..., Θ K }, where θ 0 is the background parameter of the first-order Markov model and Θ k is the k-th PWM. Then, given Θ, q = {q 0,q 1,..., q K,q K+1 } and r, the complete sequence likelihood with M and A given is P (X, M, A Θ, q,r)=p (M, A r, q)p (X(M c ) θ 0, M)P (X(M) M, A, Θ). (2) Note that parameters q and r are unknown but the emission probability parameters Θ are given. 2.3 Baysian inference The HMM with its unknown parameters r and q can be estimated by a standard Baum- Welch algorithm (Baum 1972). We, however, take an alternative approach through Bayesian framework. Combining Equation(2) with prior distributions for the parameters q and r gives rise to the joint posterior distribution: P (M, A, q,r X, Θ) P (X, M, A Θ, q,r)π(q)π(r), (3) where π(q) is the conjugate prior distribution for q, a Dirichlet distribution with parameter α = {α 0,..., α K,α K+1 },andπ(r)isbeta(a, b). By simulating from the posterior distribution, we obtain the estimated parameters ˆq, ˆr and consequently the hidden path ˆM, Â. Themost probable hidden path provides the optimal locations of the modules and the binding sites. 9

10 The Bayesian inferences of the unknown parameters and the hidden path are derived by iteratively sampling. First the initial values of q and r are generated from the prior distributions Dirichlet π(q) and Beta π(r) respectively. Then the following two steps continuously update the parameters. 1. Sampling M and A given q and r. To simulate the hidden path of M and A, theforward algorithm of HMM approaches (Durbin et al. 1998) is first used to calculate the marginal probability of X, summing over all possible hidden paths. Consider a sequence X = {x 1,..., x L }. For a partial sequence {x 1,..., x m }, m L, leth m (k) be the probability of the observed sequence, requiring that the last residue x m is in state k, h m (k) =P ({x 1,..., x m }, state of x m = k Θ, q,r), k =0, 1,..., K, K +1. If k is a motif, then the last residue is the last position of the motif. Note that P (X Θ, q,r)= K+1 k=0 h L(k). With the HMM shown in Figure 1, the recursion formulae are given by h m (0) = (1 r)p (x m x m 1,θ 0 )h m 1 (0) + q 0 P (x m x m 1,θ 0 )h m 1 (K +1), h m (K +1) = K rp(x m x m 1,θ 0 )h m 1 (0) + h m 1 (k) + q K+1 P (x m x m 1,θ 0 )h m 1 (K +1), h m (k) = q k P ({x m wk +1,..., x m } Θ k )h m wk (K +1), k =1,..., K, (4) k=1 where P ({x m wk +1,..., x m } Θ k ) is the probability of generating the segment from the kth motif model PWM k. The initial conditions of Formula (4) are h 0 (0) = 1 and h m (k) = 0, for k =1,..., K +1andm 0. 10

11 With all the values h m (k) calculated,backward sampling is used to sample M and A as follows. Starting from m = L, at position m, we generate the hidden state of x m as either the background outside of modules, i.e. state 0, or the last position of a module, i.e. one of the states k =1,..., K, K + 1, by probabilities proportional to the corresponding h m (0), h m (k) andh m (K + 1) in Formula (4). Once we generate the last position of a module and suppose we are backward at position m in the current module, the sequence segment {x m wk +1,..., x m },(k =1,..., K, K +1), is drawn as a background letter (w K+1 = 1) or a site for one of the K motifs with probability proportional to h m (k) andh m (K + 1) in Formula (4). Depending on the generated state, we move to the front position and repeat the sampling procedure until reach the first position of the sequence. 2. Sampling q and r given M and A. Given the current sample M and A, denote the total number of modules by M, the length of each module by l j, j =1,..., M, the number of sites for the k-th motif by A k,and A K+1 = M j=1 l j K k=1 A k w k is the total length of non-site background segments within the modules. Then [q M, A] follows Dirichlet( M + α 0, A 1 + α 1,..., A K + α K, A K+1 + α K+1 ). Similarly, [r M] =Beta( M + a, L M j=1 l j + b), where L is the total length of X. We infer the module by the marginal probability of M, i.e. each sequence position being sampled as within modules. Formally, the sampling procedures are repeated 1000 times. The positions where the frequency of being a module is > 0.5 are predicted as modules. Similarly, the positions where the frequency of being PWM k, k =1,..., K, is greater than a cutoff are predicted as the TFBSs. The overlapping motifs are removed by selecting the one 11

12 that has a larger frequency. 3 Results We compare our HMM with several algorithms, including the de novo methods, the other program using TRANSFAC database (CREME by Sharan et al. 2003), and the existing HMM approach. To clarify the notation, we call our HMM Module HMM. 3.1 E2F Binding Sites The E2F family of transcription factors plays a crucial and well-established role in cell cycle progression. Experimental identification of E2F regulated genes is time consuming and expensive. Whereas, computational approaches such as our HMM method would provide useful guidelines to detect E2F binding sites and hence identify new target genes bound by E2F. A set of experimentally proven E2F binding sites has been reported by Kel et al. (see Table 1 in Kel et al. 2001). Using the gene names provided there, we collected 17 human promoter sequences from the genome database at UC Santa Cruz (genome.ucsc.edu). Each promoter sequence is 900 bp long with 700 bp upstream and 200 bp downstream from the transcription start site. These 17 human promoter sequences contain 21 experimentally verified E2F sites according to Kel et al. (2001). The Module HMM method was applied to each of the 17 sequences. First we decided the set of possible TFBSs for each sequence. All the PWMs in TRANSFAC were used to scan a sequence. The most significant PWMs with p-values less than 0.10 are selected, which gives 20 to 30 PWMs for constructing an HMM. Based on experiences, the prior parameters of 12

13 the Dirichlet distribution of q are set as α 0 = α 1 =... = α K =1andα K+1 = 100. The prior parameters of the Beta distribution for r are a = 100 and b = 800. The sampling procedure was applied 1000 times on each sequence. The regions that were sampled as modules with more than 50% of times were predicted as modules. Within modules, the most frequently occurring PWMs (out of 1000 simulations) were reported as the TFBSs. We evaluated Module HMM by comparing it with other programs. The existing HMM algorithms (Frith et al. 2001; Ahab algorithm in Rajewsky et al. 2002; Sinha et al. 2003) require a small set of known TFBSs contained in the sequences, which is not an assumption in this example. We compared our HMM approach with CREME (Sharan et al. 2003), BioProspector (BP) (Liu et al. 2001), and CisModule (Zhou and Wong 2004). Table 1 shows the true positive (sensitivity) and false positive (specificity) of Module HMM with each of the three programs. The true positives were referred to as the number of correctly predicted E2F sites (out of 21 true E2F sites). The false positives were obtained by applying the programs in data sets of randomly generated sequences. The random sequences were generated by a firstorder Markov chain whose probability matrix was estimated from another set of randomly selected human promoter sequences in the genome database of UCSC (genome.ucsc.edu). Each random sequence has the same length of 900 bp. In Table 1, the first comparison shows Module HMM identified 12 true E2F sites out of 21 in the known E2F binding sequences while only predicted 5 E2F sites in 100 random sequences. On the contrary, CREME identified none site at a signficant level 5%. In the second comparison of Module HMM with BioProspector, we did not assume a CRM must contain E2F sites but considered CRMs with any type of TFBS motif. Module HMM predicted 612 TFBSs in the 100 random sequences with 57% true positive rate in the known E2F binding sequences, whereas BioProspector predicted 623 TFBSs in the 100 random 13

14 sequences with 3 true positive. The third comparison of Module HMM with CisModule was on a combined data set of 37 sequences, where 17 were from the known E2F sequences and 20 were random sequences. Again, Module HMM had a similar false positive rate with CisModule in the random sequences (15 vs. 13 predictions of any motif in 20 random sequences) but higher true positive in the known E2F sequences (7 vs. 0). True E2F sites predicted in the E2F set Prediction rate of E2F in 100 random sequences Module HMM 12/21 (57%) 5 CREME 0 < 5% True E2F sites predicted in the E2F set Prediction rate of any motif in 100 random sequences Module HMM 12/21 (57%) 612 BP 3/21 (14%) 623 True E2F sites predicted in the E2F set Prediction rate of any motif in 20 random sequences Module HMM 7/21 (33%) 15 CisModule 0 13 Table 1: Comparison of Module HMM, CREME, BioProspector, and CisModule. The first column shows true positives in the 17 known E2F binding sequences. The last column gives false positives in random sequences. The value 5% is the significant level of CREME. Besides identifying the E2F sites in the data set, our HMM provides cis-regulatory module predictions. One of the predictions was confirmed by a well-studied promoter sequence of human CDC2, whose proximal TFBSs has been annotated in the upstream of about

15 bp (Tommasi and Pfeifer 1995). There are a total 12 annotated TFBSs (see Figure 6 in Tommasi and Pfeifer 1995), including an E2F binding site. Our HMM approach identified 10 sites, clustered into 3 modules. The result is shown in Figure 2. In the remaining regions of the sequence, there were more binding sites predicted (not shown in Figure 2) which would be putative TFBSs. AGACCCAGTC TCTAAATGCA TGCCTCTCTC TATATATTTA AAATTCTGAT GTGAAAATAT TTTAAAATTT AATACATTTC -620 AAATGTTTTT AATTGTATAA TAAACAAAAT GTAAATAATA AAATAATTTA ATATTAAATT CAAAAATGAG GTAGAAACAA -540 AGCACAGCGA TATAAATAAT AAATTTTCCT TTACATTTTT GAGGCGGTCT TTTGAGTTTT CCATTTCCTT CTTAAGGTCA -460 CTGAAATGTG CTCCTTGGAG CCAGCCCGCA AATCACGCAT TTAGAAAAAC ATAACTATAC ACTCCTAACC CTAAGTATTA -380 GAAGTGAAAG TAATGGAATC TCGATGTAAA CACAATATCA CTTTTTTGAT GAGCTATTTT GAGTATAATA AATTTGAACT -300 GTGCCAATGC TGGGAGAAAA AATTTAAAAG AAGAACGGAG CGAACAGTAG CTTCCTGCTC CGCTGACTAG AAACAGTAGG Sp1 ACGACACTCT CCCGACTGGA GGAGAGCGCT TGCGCTCGCA CTCAGTTGGC GCCCGCCCTC CTGCTTTTTC TCTAGCCGCC E2F ets-2 CCAAT CTTTCCTCTT TCTTTCGCGC TCTAGCCACC CGGGAAGGCC TGCCCAGCGT AGCTGGGCTC TGATTGGCTG CTTTGAAAGT CCAAT CTACGGGCTA CCCGATTGGT GAATCCGGGG CCCTTTAGCG CGGTGAGTTT GAAACTGCTC GCACTTGGCT TCAAAGCTGG CTCTTGGAAA TTGAGCGGAG AGCGACGCGG TTGTTGTAGC TGCCGCTGCG GCCGCCGCGG AATAATAAGC CGGGTACAGT +100 GGCTGGGGTC AGGGTCGTGT CTAGGGGACG GCCGAGGGCC TCGGAGGGCG AGTATTGAGG AACGGGGTCC TCTAAGAAGG +180 CCGGACTGGA GGTCAGGGAT +200 Figure 2: The HMM method correctly identified TFBSs in the human CDC2 promoter. The modules are marked in square brackets and the TFBSs are in ovals. The arrow marks the transciption initiation site. Compared to Figure 6 in (Tommasi and Pfeifer 1995), ten out of twelve TFBSs were identified by the HMM method. 3.2 Muscle-Specific Regulatory Modules Extensive studies of genes expressed in skeletal muscle have identified specific transcription factors which bind to regulatory elements to control gene expression. The transcription factors that contribute to skeletal muscle-specific expression include Myf, Mef-2, TEF, SRF, 15

16 and Sp-1 (Wasserman and Fickett 1998). Analysis of experimentally determined muscle regulatory sequences indicates that the binding sites of these TFs occur in close proximity as regulatory modules. Zhou and Wong (2004) collected a data set with 29 known regulatory sequences for skeletal muscle-specific expression and 10 random upstream sequences from ENSEMBL database ( each 200 bp long. We applied the Module HMM method to the same data set and compared it with CisModule. Out of the 29 known regulatory sequences for muscle-specific gene expression, 22 are annotated in a database of muscle-specific regulation of transcription ( The database provides a summary of the published experimental evidences about muscle-specific transcriptional regulation, including the TFBSs. We used the information there to verify our findings of the Module HMM method. Table 2 shows a comparison of Module HMM with CisModule. The number of true positives of each TFBSs are summarized in a format n 1 /n 2,wheren 2 is the number of experimentally annotated (true) TFBSs and n 1 is the number of true sites predicted by a program. The false positive numbers were obtained when the five muscle-specific TFBSs were predicted in the 10 random sequences in the data set. A big improvement of the Module HMM method over CisModule is that we found all Myf binding sites, 16 out of 16 annotated ones, whereas CisModule (as well as BioProspector and MEME) failed to identify any. Corresponding to this high true positive rate, Module HMM predicted 5 (false) Myf sites in the random sequences. For the other TFs, the Module HMM results were better than the results of CisModule as reported in Zhou and Wong (2004), with higher true positive and lower false positive rates, excecpt for TEF where Module HMM had higher false positives (5 vs. 0). Moreover, the Module HMM method produced similar results in repeated simulations. 16

17 In contrast, CisModule gave different results in different simulations. Algorithm Myf Mef-2 TEF SRF True Positive Module HMM 16/16 10/10 9/10 13/13 CisModule 0 9/10 7/10 6/13 False Positive Module HMM CisModule Table 2: Comparison of our HMM method with CisModule on identifying muscle-specific regulatory regions. The results of CisModule is cited from (Zhou and Wong 2004). 3.3 Regulatory sequences of Drosophila melanogaster As a comparison with the existing HMM algorithm (Rajewsky et al. 2002; Sinha et al. 2003), we tested our HMM method on the same gene sequences of the fly Drosophila melanogaster in Rajewsky et al. (2002). We collected upstream sequences of 14 Drosophila developmental genes, each 12 kb long. Our goal is to identify regulatory elements in these genetic sequences. From the literature, there are eight well studied transcription factors, i.e., Bicoid (Bcd), Caudal (Cad), Dorsal (Dl), Hunchback (Hb), Kruppel (Kr), Knirps (Kni), Tailless (Tll) and the torso response element (torre), acting at very early stages of Drosophila embryo development (Niessing et al. 1997). Note that the HMM algorithm of Sinha et al. (2003) and the first algorithm (Ahab) in Rajewsky et al. (2002) require known TFBSs. To make a direct comparison, we first built our HMM using only the 8 PWMs provided in Rajewsky et al. (2002). Figure 3 illustrates the cis-regulatory module predictions of four genes even-skipped, giant, hairy, 17

18 and hunchback. The estimated probability of being sampled as within modules is plotted as a function of the position in the sequence. We chose probability 0.6 as a cutoff to predict a CRM. The known module locations from the literature (Table 1 in Sinha et al and the collection at dap5/data 04/appendix2.htm#crms) are marked by the shaded area in Figure 3. Our Module HMM identified 25 out of total 27 true modules, whereas the HMM algorithm of Rajewsky et al. (2002) and Sinha et al. (2003) identified 15 and 16 respectively. Two additional modules were predicted by Module HMM with probability larger than 0.6. One was in gene Knirps, which was the same prediction as in Sinha et al. (2003). The other was in gene even-skipped, different from Sinha et al. s extra prediction in gene runt. These new predictions may be putative modules to be further investigated. It is more interesting to search for CRMs without assuming known information of the eight Drosophila transcription factors. The HMM algorithm of Sinha et al. (2003) requires known PWMs therefore is not applicable in this case. However, the algorithm Argos (Rajewsky et al. 2002) can predict CRMs from raw genomic sequences. To compare Module HMM with Argos, we started by scanning each 12 kb genomic sequence using all 43 available Drosophila PWMs in TRANSFAC and then predicted CRMs through forward algorithm and backward sampling. The results were comparable with that of Argos, which has 50% true positive rate. For seven well studied genes (e.g. even-skipped, giant, hairy, hunchback, knirps, Kruppel, and tailless), with 15 known modules, Argos identified 7 but we identified 8 when looking over the 12 kb upstream sequences. The false positive rate was assessed by applying Module HMM to sequences randomly selected from the fly genome. In a random set of ten 24 kb genomic sequences, 12 kb upstream and 12 kb downstream, Module HMM predicted 0.8 module per 5 kb, which is lower than the false positive rate of Argo, one module per 5 kb. 18

19 Frequency Frequency even giant Frequency Frequency hairy hunchback Figure 3: Module prediction in four Drosophila genes. The probability of being sampled as within modules is plotted as a function of the position in the sequences. The four genes are even-skipped, giant, hairy, and hunchback respectively. The true module locations are indicated by the vertical lines. 19

20 Careful investigation of the 43 TRANSFAC PWMs shows that this collection only contains 5 (Bcd, Dl, Hb, Kr, Tll) out of the 8 known transcription factors. When we extended the current TRANSFAC collection of Drosophila transcription factors to include all the 8 TFBSs, Module HMM identified 16 out of 27 true CRMs, or 60% true positive rate, compared with 50% true positive of Argos (Rajewsky et al. 2002) in the 14 genomic sequences. For the seven well-studied genes, Module HMM with the extended collection of PWMs had higher true positive rate but lower false positive rate. Namely, Module HMM predicted 10 out of 15 true modules, whereas the false prediction rate in random sequences is 0.81 module per 5 kb. Table 3 summarizes the comparisons between our HMM approach and the HMM algorithms of Rajewsky et al. (2002) and Sinha et al. (2003). The new features that we introduced in Module HMM increase the sensitivity of CRM prediction. Specifically, a large collection of PWMs would provid better prediction in Module HMM. The next application examined Module HMM in identifying certain CRMs in a whole genome scale search. We focused on CRMs that consist of TFBSs from the 8 transcription factors active in the early Drosophila embryo. Our goal was to predict developmental genes with this type of CRM present near the transcription start sites (TSS). The fly genome sequences were downloaded from UCSC genome database D. melanogaster genome April 2004 (dm2) There are total 4368 non-redundant genes in our collection, each containing 12 kb upstream and 12 kb downstream sequences from the TSS. Module HMM with the 8 PWMs was run for each 24 kb sequence. We found 479 CRMs located in 327 genes. This indicates a low false positive rate of Module HMM. The predictions were available at jingwu/module/fly/flygenomepredict.txt. This list provides potential Drosophila developmental genes and their regulatory regions. 20

21 True positive Module HMM Sinha Rajewsky in 14 genes 8 known PWMs 25/27 (93%) 16/27 (59%) 15/27 (56%) PWMs unknown 16/27 (60%) n/a 50% True positive Module HMM Module HMM Rajewsky TRANSFAC TRANSFAC+8PWMs in 7 genes 8/15 (53%) 10/15 (67%) 7/15 (47%) False positive 0.8module/5kb 0.81module/5kb 1 module/5kb Table 3: Comparison of Module HMM method with the HMM of Rajewsky et al. (2002) and Sinha et al. (2003). The true positive numbers are the number of predicted modules over the total number of known modules. The false positives are the predictions of any module type in random sequences. PWMs are selected from TRANSFAC plus the 8 known TFBSs. 4 Discussion The HMM developed here models CRM state in addition to individual TFBSs. To construct an HMM, we first select potential TFBSs for a given sequence from all available PWMs (PWM databases plus the literature), then fit the HMM using the stochastic sampling procedure. These new features improve the performance of the existing HMM algorithm. Compared with other algorithms that do not assume a given set of TFBSs, i.e., Argos (Rajewsky et al. 2003) and CisModule (Zhou andwong 2004), our method utilizes a large database of PWMs to decide potential TFBSs. The available PWMs provide useful information therefore would improve searches in general. More importantly, the annotated TFBSs in our predictions will indicate direct relationships between transcription factors and genes, which is a favorable feature over the de novo methods. On the other hand, as noted by other 21

22 researchers (Rahmann et al. 2003), the TRANSFAC database is limited by the number and precision of the PWMs. The example of Drosophila in Section 3 implies Module HMM depends on the catalog of known PWMs. Module HMM works the best if all potential TFBSs of a genomic sequence were contained in the PWM pool. In addition to TRANSFAC, it is better to include more annotated PWMs in the pre-selection procedure. Our HMM method searches for modules and motifs one sequence at a time. Therefore it does not require good collections of correlated genes. It works on any size of data sets. Besides, because the emission probabilities for PWMs are obtained from the procedure of pre-selection, the algorithm is computationally efficient and there is no convergence difficulty as the common problem of Gibbs sampling approaches. When applying the Module HMM method in Section 3, we found that multiple runs of the sampling procedures produced similar results, with the similar number of true positives and the same binding site locations. However, both BioProspector and CisModule gave different results in different simulations. The Module HMM method provides a joint probability of modules and within-module motifs, which models the dependence between motifs in a certain level. The current version of the HMM can be extended to complicated models that describe the order or precise spacing of the cooperating TFBSs. A future work would be to connect the results of promoter sequence analysis with the gene expression data. It would be beneficial to identify modules of TFBSs that confer temporal and spatial expression patterns for a group of co-expressed genes. 22

23 Acknowledgement We are grateful to James Fleet for helping collect the E2F sequence data, Ker-Chau Li for helpful discussion on the work. We also thank the two anonymous referees for useful comments that improve the paper. References Bailey, T.L. and Noble, W.S Searching for statistically significant regulatory modules. Bioinformatics, 19 Suppl. 2, ii16-ii25. Baum, L.E An equality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3, 1-8. Crowley, E.M., Roeder, K. and Bina, M A statistical model for locating regulatory regions in genomic DNA. J. Mol. Biol.. 268, Dempster, A.P., Laird, N.M. and Rubin, D.B Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society B 39, Durbin, R., Eddy, S., Krogh, A. and Mitchison, G Biological Sequence Analysis. Cambridge University Press, Cambridge, United Kingdom. Elkon, R., Linhart, C., Sharan, R., Shamir, R. and Shiloh, Y Genome-wide in silico identification of transcriptional regulators controlling the cell cycle in human cells. Genome Res. 13, Frith, M.C., Hansen, U., Weng, Z Detection of cis-element clusters in higher eukaryotic DNA. Bioinformatics, 17, Kel, A.E., Kel-Margoulis, O.V., Farnham, P.J., Bartley, S.M., Wingender, E. and Zhang, M.Q Computer-assisted identification of cell cycle-related genes: New targets for E2F 23

24 transcription factors. J. Mol. Biol. 309, Lawrence, C.E., Altschul, S.F., Boguski, M.S., Liu, J.S., Neuwald, A.F. and Wootton, J.C Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment. Science, 262, Liu, X., Brutlag, D.L. and Liu, J.S BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput., Niessing, D., Rivera-Pomar, R., La Rosee, A., Hader, T., Schock, F., Purnell, B.A., and Jackle, H A cascade of transcriptional control leading to axis determination in Drosophila. J. Cell. Physiol. 173, Rahmann, S., Muller, T., and Vingron, M On the power of profiles for transcription factor binding site detection. Statistical Applications in Genetics and Molecular Biology, 2, Rajewsky, N., Vergassola, M., Gaul, U., and Siggia, E.D Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, 3. Roth, F.R., Hughes, J.D., Estep, P.E. and Church, G.M Finding DNA Regulatory Motifs within Unaligned Non-Coding Sequences Clustered by Whole-Genome mrna Quantitation. Nature Biotechnology. 16, Sharan, R., Ovcharenko, I., Ben-Hur, A. and Karp, R.M CREME: A framework for identifying cis-regulatory modules in human-mouse conserved segments. Bioinformatics, 19, 1-9. Sinha, S., van Nimwegen, E., and Siggia, E.D A probabilistic method to detect regulatory modules. Bioinformatics, 19 Suppl. 1, i292-i

25 Thompson, W., Palumbo, M.J., Wasserman, W.W., Liu, J.S., and Lawrence, C.E Decoding human regulatory circuits. Genome Research, 14, Tommasi, S. and Pfeifer, G.P In vivo structure of the human cdc2 promoter: Release of a p130-e2f-4 complex from sequences immediately upstream of the transcription initiation site coincides with induction of cdc2 expression. Mol. and Cell. Biol. 15, Wasserman, W.W. and Fickett, J.W Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol. 278, Wingender, E., Chen, X., Hehl, R., et al TRANSFAC: an integrated system for gene expression regulation. Nucleic Acids Res. 28, Yuh, C.H., Bolouri, H. and Davidson, E.H Genomic cis-regulatory logic: Experimental and computational analysis of a sea urchin gene. Science, 279, Zhou, Q. and Wong, W.H CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling. PNAS, 101,

Transcription factors (TFs) regulate genes by binding to their

Transcription factors (TFs) regulate genes by binding to their CisModule: De novo discovery of cis-regulatory modules by hierarchical mixture modeling Qing Zhou* and Wing H. Wong* *Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, MA 02138;

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Chapter 7: Regulatory Networks

Chapter 7: Regulatory Networks Chapter 7: Regulatory Networks 7.2 Analyzing Regulation Prof. Yechiam Yemini (YY) Computer Science Department Columbia University The Challenge How do we discover regulatory mechanisms? Complexity: hundreds

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Alignment. Peak Detection

Alignment. Peak Detection ChIP seq ChIP Seq Hongkai Ji et al. Nature Biotechnology 26: 1293-1300. 2008 ChIP Seq Analysis Alignment Peak Detection Annotation Visualization Sequence Analysis Motif Analysis Alignment ELAND Bowtie

More information

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98)

Hidden Markov Models. Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) Hidden Markov Models Main source: Durbin et al., Biological Sequence Alignment (Cambridge, 98) 1 The occasionally dishonest casino A P A (1) = P A (2) = = 1/6 P A->B = P B->A = 1/10 B P B (1)=0.1... P

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

MCMC: Markov Chain Monte Carlo

MCMC: Markov Chain Monte Carlo I529: Machine Learning in Bioinformatics (Spring 2013) MCMC: Markov Chain Monte Carlo Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington Spring 2013 Contents Review of Markov

More information

Hidden Markov Models for biological sequence analysis I

Hidden Markov Models for biological sequence analysis I Hidden Markov Models for biological sequence analysis I Master in Bioinformatics UPF 2014-2015 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Example: CpG Islands

More information

Chapter 8. Regulatory Motif Discovery: from Decoding to Meta-Analysis. 1 Introduction. Qing Zhou Mayetri Gupta

Chapter 8. Regulatory Motif Discovery: from Decoding to Meta-Analysis. 1 Introduction. Qing Zhou Mayetri Gupta Chapter 8 Regulatory Motif Discovery: from Decoding to Meta-Analysis Qing Zhou Mayetri Gupta Abstract Gene transcription is regulated by interactions between transcription factors and their target binding

More information

Deciphering regulatory networks by promoter sequence analysis

Deciphering regulatory networks by promoter sequence analysis Bioinformatics Workshop 2009 Interpreting Gene Lists from -omics Studies Deciphering regulatory networks by promoter sequence analysis Elodie Portales-Casamar University of British Columbia www.cisreg.ca

More information

Hidden Markov Models for biological sequence analysis

Hidden Markov Models for biological sequence analysis Hidden Markov Models for biological sequence analysis Master in Bioinformatics UPF 2017-2018 http://comprna.upf.edu/courses/master_agb/ Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA

More information

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye! Today Gene regulation Synteny Good bye! Gene regulation What governs gene transcription? Genes active under different circumstances. Gene regulation What governs gene transcription? Genes active under

More information

Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test *

Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test * JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 27, 855-868 (20) Similarity Analysis between Transcription Factor Binding Sites by Bayesian Hypothesis Test * QIAN LIU +, SAN-YANG LIU AND LI-FANG LIU + Department

More information

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji

Gene Regula*on, ChIP- X and DNA Mo*fs. Statistics in Genomics Hongkai Ji Gene Regula*on, ChIP- X and DNA Mo*fs Statistics in Genomics Hongkai Ji (hji@jhsph.edu) Genetic information is stored in DNA TCAGTTGGAGCTGCTCCCCCACGGCCTCTCCTCACATTCCACGTCCTGTAGCTCTATGACCTCCACCTTTGAGTCCCTCCTC

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Paper by: James P. Balhoff and Gregory A. Wray Presentation by: Stephanie Lucas Reviewed

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

What is the expectation maximization algorithm?

What is the expectation maximization algorithm? primer 2008 Nature Publishing Group http://www.nature.com/naturebiotechnology What is the expectation maximization algorithm? Chuong B Do & Serafim Batzoglou The expectation maximization algorithm arises

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University

Markov Chains and Hidden Markov Models. COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models COMP 571 Luay Nakhleh, Rice University Markov Chains and Hidden Markov Models Modeling the statistical properties of biological sequences and distinguishing regions

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Basic math for biology

Basic math for biology Basic math for biology Lei Li Florida State University, Feb 6, 2002 The EM algorithm: setup Parametric models: {P θ }. Data: full data (Y, X); partial data Y. Missing data: X. Likelihood and maximum likelihood

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Application to

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

SI Materials and Methods

SI Materials and Methods SI Materials and Methods Gibbs Sampling with Informative Priors. Full description of the PhyloGibbs algorithm, including comprehensive tests on synthetic and yeast data sets, can be found in Siddharthan

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites

UDC Comparative Analysis of Methods Recognizing Potential Transcription Factor Binding Sites Molecular Biology, Vol. 35, No. 6, 2001, pp. 818 825. Translated from Molekulyarnaya Biologiya, Vol. 35, No. 6, 2001, pp. 961 969. Original Russian Text Copyright 2001 by Pozdnyakov, Vityaev, Ananko, Ignatieva,

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

A Sequential Monte Carlo Method for Motif Discovery Kuo-ching Liang, Xiaodong Wang, Fellow, IEEE, and Dimitris Anastassiou, Fellow, IEEE

A Sequential Monte Carlo Method for Motif Discovery Kuo-ching Liang, Xiaodong Wang, Fellow, IEEE, and Dimitris Anastassiou, Fellow, IEEE 4496 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 56, NO. 9, SEPTEMBER 2008 A Sequential Monte Carlo Method for Motif Discovery Kuo-ching Liang, Xiaodong Wang, Fellow, IEEE, and Dimitris Anastassiou, Fellow,

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

LIST of SUPPLEMENTARY MATERIALS

LIST of SUPPLEMENTARY MATERIALS LIST of SUPPLEMENTARY MATERIALS Mir et al., Dense Bicoid Hubs Accentuate Binding along the Morphogen Gradient Supplemental Movie S1 (Related to Figure 1). Movies corresponding to the still frames shown

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding

Example: The Dishonest Casino. Hidden Markov Models. Question # 1 Evaluation. The dishonest casino model. Question # 3 Learning. Question # 2 Decoding Example: The Dishonest Casino Hidden Markov Models Durbin and Eddy, chapter 3 Game:. You bet $. You roll 3. Casino player rolls 4. Highest number wins $ The casino has two dice: Fair die P() = P() = P(3)

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

EM-algorithm for motif discovery

EM-algorithm for motif discovery EM-algorithm for motif discovery Xiaohui Xie University of California, Irvine EM-algorithm for motif discovery p.1/19 Position weight matrix Position weight matrix representation of a motif with width

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

HMM : Viterbi algorithm - a toy example

HMM : Viterbi algorithm - a toy example MM : Viterbi algorithm - a toy example 0.6 et's consider the following simple MM. This model is composed of 2 states, (high GC content) and (low GC content). We can for example consider that state characterizes

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Hidden Markov

More information

Hidden Markov Models. Ron Shamir, CG 08

Hidden Markov Models. Ron Shamir, CG 08 Hidden Markov Models 1 Dr Richard Durbin is a graduate in mathematics from Cambridge University and one of the founder members of the Sanger Institute. He has also held carried out research at the Laboratory

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Data Mining in Bioinformatics HMM

Data Mining in Bioinformatics HMM Data Mining in Bioinformatics HMM Microarray Problem: Major Objective n Major Objective: Discover a comprehensive theory of life s organization at the molecular level 2 1 Data Mining in Bioinformatics

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

Lecture 5: December 13, 2001

Lecture 5: December 13, 2001 Algorithms for Molecular Biology Fall Semester, 2001 Lecture 5: December 13, 2001 Lecturer: Ron Shamir Scribe: Roi Yehoshua and Oren Danewitz 1 5.1 Hidden Markov Models 5.1.1 Preface: CpG islands CpG is

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS

CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS CS 4491/CS 7990 SPECIAL TOPICS IN BIOINFORMATICS * The contents are adapted from Dr. Jean Gao at UT Arlington Mingon Kang, Ph.D. Computer Science, Kennesaw State University Primer on Probability Random

More information

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010 Hidden Markov Models Aarti Singh Slides courtesy: Eric Xing Machine Learning 10-701/15-781 Nov 8, 2010 i.i.d to sequential data So far we assumed independent, identically distributed data Sequential data

More information

A Combined Motif Discovery Method

A Combined Motif Discovery Method University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-6-2009 A Combined Motif Discovery Method Daming Lu University of New Orleans Follow

More information

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM) Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM) The data sets alpha30 and alpha38 were analyzed with PNM (Lu et al. 2004). The first two time points were deleted to alleviate

More information

Axis determination in flies. Sem 9.3.B.5 Animal Science

Axis determination in flies. Sem 9.3.B.5 Animal Science Axis determination in flies Sem 9.3.B.5 Animal Science All embryos are in lateral view (anterior to the left). Endoderm, midgut; mesoderm; central nervous system; foregut, hindgut and pole cells in yellow.

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007 Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.

More information

Whole-genome analysis of GCN4 binding in S.cerevisiae

Whole-genome analysis of GCN4 binding in S.cerevisiae Whole-genome analysis of GCN4 binding in S.cerevisiae Lillian Dai Alex Mallet Gcn4/DNA diagram (CREB symmetric site and AP-1 asymmetric site: Song Tan, 1999) removed for copyright reasons. What is GCN4?

More information

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing Hidden Markov Models By Parisa Abedi Slides courtesy: Eric Xing i.i.d to sequential data So far we assumed independent, identically distributed data Sequential (non i.i.d.) data Time-series data E.g. Speech

More information

Computational Genomics. Uses of evolutionary theory

Computational Genomics. Uses of evolutionary theory Computational Genomics 10-810/02 810/02-710, Spring 2009 Model-based Comparative Genomics Eric Xing Lecture 14, March 2, 2009 Reading: class assignment Eric Xing @ CMU, 2005-2009 1 Uses of evolutionary

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

A Phylogenetic Gibbs Recursive Sampler for Locating Transcription Factor Binding Sites

A Phylogenetic Gibbs Recursive Sampler for Locating Transcription Factor Binding Sites A for Locating Transcription Factor Binding Sites Sean P. Conlan 1 Lee Ann McCue 2 1,3 Thomas M. Smith 3 William Thompson 4 Charles E. Lawrence 4 1 Wadsworth Center, New York State Department of Health

More information

What DNA sequence tells us about gene regulation

What DNA sequence tells us about gene regulation What DNA sequence tells us about gene regulation The PhyloGibbs-MP approach Rahul Siddharthan The Institute of Mathematical Sciences, Chennai 600 113 http://www.imsc.res.in/ rsidd/ 03/11/2007 Rahul Siddharthan

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Alignment Algorithms. Alignment Algorithms

Alignment Algorithms. Alignment Algorithms Midterm Results Big improvement over scores from the previous two years. Since this class grade is based on the previous years curve, that means this class will get higher grades than the previous years.

More information

Variable Selection in Structured High-dimensional Covariate Spaces

Variable Selection in Structured High-dimensional Covariate Spaces Variable Selection in Structured High-dimensional Covariate Spaces Fan Li 1 Nancy Zhang 2 1 Department of Health Care Policy Harvard University 2 Department of Statistics Stanford University May 14 2007

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

Hidden Markov Models (I)

Hidden Markov Models (I) GLOBEX Bioinformatics (Summer 2015) Hidden Markov Models (I) a. The model b. The decoding: Viterbi algorithm Hidden Markov models A Markov chain of states At each state, there are a set of possible observables

More information

Network motifs in the transcriptional regulation network (of Escherichia coli):

Network motifs in the transcriptional regulation network (of Escherichia coli): Network motifs in the transcriptional regulation network (of Escherichia coli): Janne.Ravantti@Helsinki.Fi (disclaimer: IANASB) Contents: Transcription Networks (aka. The Very Boring Biology Part ) Network

More information

Markov Chain Monte Carlo methods

Markov Chain Monte Carlo methods Markov Chain Monte Carlo methods By Oleg Makhnin 1 Introduction a b c M = d e f g h i 0 f(x)dx 1.1 Motivation 1.1.1 Just here Supresses numbering 1.1.2 After this 1.2 Literature 2 Method 2.1 New math As

More information

Phylogenetic Gibbs Recursive Sampler. Locating Transcription Factor Binding Sites

Phylogenetic Gibbs Recursive Sampler. Locating Transcription Factor Binding Sites A for Locating Transcription Factor Binding Sites Sean P. Conlan 1 Lee Ann McCue 2 1,3 Thomas M. Smith 3 William Thompson 4 Charles E. Lawrence 4 1 Wadsworth Center, New York State Department of Health

More information

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models.

Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Speech Recognition Lecture 8: Expectation-Maximization Algorithm, Hidden Markov Models. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.com This Lecture Expectation-Maximization (EM)

More information

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Written Exam 15 December Course name: Introduction to Systems Biology Course no Technical University of Denmark Written Exam 15 December 2008 Course name: Introduction to Systems Biology Course no. 27041 Aids allowed: Open book exam Provide your answers and calculations on separate

More information

Stephen Scott.

Stephen Scott. 1 / 27 sscott@cse.unl.edu 2 / 27 Useful for modeling/making predictions on sequential data E.g., biological sequences, text, series of sounds/spoken words Will return to graphical models that are generative

More information

E-value Estimation for Non-Local Alignment Scores

E-value Estimation for Non-Local Alignment Scores E-value Estimation for Non-Local Alignment Scores 1,2 1 Wadsworth Center, New York State Department of Health 2 Department of Computer Science, Rensselaer Polytechnic Institute April 13, 211 Janelia Farm

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Rui Dilão NonLinear Dynamics Group, IST

Rui Dilão NonLinear Dynamics Group, IST 1st Conference on Computational Interdisciplinary Sciences (CCIS 2010) 23-27 August 2010, INPE, São José dos Campos, Brasil Modeling, Simulating and Calibrating Genetic Regulatory Networks: An Application

More information

Weighted Finite-State Transducers in Computational Biology

Weighted Finite-State Transducers in Computational Biology Weighted Finite-State Transducers in Computational Biology Mehryar Mohri Courant Institute of Mathematical Sciences mohri@cims.nyu.edu Joint work with Corinna Cortes (Google Research). 1 This Tutorial

More information

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung:

Assignments for lecture Bioinformatics III WS 03/04. Assignment 5, return until Dec 16, 2003, 11 am. Your name: Matrikelnummer: Fachrichtung: Assignments for lecture Bioinformatics III WS 03/04 Assignment 5, return until Dec 16, 2003, 11 am Your name: Matrikelnummer: Fachrichtung: Please direct questions to: Jörg Niggemann, tel. 302-64167, email:

More information

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information