Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Size: px
Start display at page:

Download "Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis"

Transcription

1 Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): July 2008 Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Xiao-Yong Fang ( ), Zhi-Gang Luo ( ), and Zheng-Hua Wang ( ) School of Computer Science, National University of Defense Technology, Changsha , China xyfang@nudt.edu.cn; zgluo@nudt.edu.cn Received July 2, 2007; revised May 3, Abstract Stochastic context-free grammars (SCFGs) have been applied to predicting RNA secondary structure. The prediction of RNA secondary structure can be facilitated by incorporating with comparative sequence analysis. However, most of existing SCFG-based methods lack explicit phylogenic analysis of homologous RNA sequences, which is probably the reason why these methods are not ideal in practical application. Hence, we present a new SCFG-based method by integrating phylogenic analysis with the newly defined profile SCFG. The method can be summarized as: 1) we define a new profile SCFG, M, to depict consensus secondary structure of multiple RNA sequence alignment; 2) we introduce two distinct hidden Markov models, λ and λ, to perform phylogenic analysis of homologous RNA sequences. Here, λ is for non-structural regions of the sequence and λ is for structural regions of the sequence; 3) we merge λ and λ into M to devise a combined model for prediction of RNA secondary structure. We tested our method on data sets constructed from the Rfam database. The sensitivity and specificity of our method are more accurate than those of the predictions by Pfold. Keywords RNA secondary structure, stochastic context-free grammar, phylogenic analysis 1 Introduction In recent years, non-coding RNAs (ncrnas) have gained increasing interest since a huge variety of functions associated with them were found [1 3]. The function of an RNA molecule is principally determined by its (secondary) structure. To date, experimental approaches constitute the most reliable methods for secondary structure determination [4]. Unfortunately, their difficulty and expense are often prohibitive, especially for high-throughput applications. For this reason, computational prediction provides an attractive alternative to empirical discovery of RNA secondary structure [5]. Most algorithms for RNA secondary structure prediction are based on Minimum Free Energy (MFE) [6,7]. However, there are several independent reasons why the accuracy of MFE structure prediction is limited in practice [5]. Recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for predicting RNA secondary structure [8 15]. The advantage of this kind of methods is that they are more readily extended to include other sources of statistical information that constrain a structure prediction. In general, the most powerful source of information for RNA structure prediction is probably comparative sequence analysis [16], which uses evolutionary information. Hence, there is a need for automated approaches that combine evolutionary information from comparative sequence analysis with probabilistic modeling approaches using stochastic contextfree grammars. However, most of existing SCFG-based methods except Pfold [13,14] do not explicitly take evolutionary information of RNA sequences into account. Though Pfold has achieved better results by using its so-called phylo-scfgs, much evolutionary information such as deletion and insertion is omitted for simplifying the predicting algorithm. This is probably the reason why Pfold is not ideal for some low-quality alignments, especially for the alignments containing many gaps in the sequences. In this paper, we present a new combined method for secondary structure prediction by integrating phylogenic analysis with SCFGs. Here, more information about evolutionary process of RNA sequence is considered for computing optimum secondary structure. We define a new profile SCFG by introducing some new variables and production rules. We perform phylogenic analysis of RNA sequences using two distinct evolutio- Regular Paper Supported by the National Natural Science Foundation of China under Grant No

2 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 583 nary substitution models. Here, the theory of HMM [15] is employed to take deletion and insertion into account. We devise a combined model for structure prediction by integrating the evolutionary models with the new profile SCFG. We tested our method on the data sets of the known RNA sequence alignments downloaded from the Rfam database [17]. The comparison with Pfold shows that our method can predict RNA secondary structure with a higher accuracy than Pfold. 2 Methods 2.1 Modeling RNA Secondary Structure Using Profile SCFGs An SCFG can be viewed as a device capable of both generating and parsing strings [15]. The SCFG described here takes gapped sequence alignment as input and thus is called profile SCFG. Here, we introduce the profile SCFG from the parsing point of view. The profile SCFG consists of a set of variables, some terminal and some non-terminal. Terminals correspond to columns of the input alignment and non-terminals correspond to the states of the grammar. The non-terminals are rewritten according to a set of production rules. Each production rule specifies a single non-terminal and a set of columns (if any) that it should be changed to. Successively applying the production rules of the grammar until all columns of the input alignment have been described is called parsing, which defines a parse tree. Most literature on SCFGs assumes the grammar to be in Chomsky [15] normal form for the algorithms to be used. However, the profile SCFG presented here will decompose productions into two independent parts: non-terminal transitions and terminal emissions. The profile SCFG can thus be defined by the four-tuple M = {W, T, Al, E}. Here, M is the profile SCFG; W is the set of non-terminals; T is the matrix of transition distributions; Al is the set of terminals; and E is the matrix of emission distributions. We define them as follows. W = {start, bifurcation, single, pair, end}. T = [t(w, w )], where w, w W, and t(w, w ) is the transition probability from w to w. Al = {A, C, G, U, } n, where n is the number of the sequences in the alignment and symbolizes the gap. E = [e w ], where w W. If w = pair, then e w = e(β, β ). Here, β, β Al, and (β, β ) is a columnpair which means two paired columns (see Fig.1(a)). If w = single, then e w = e(γ). Here, γ Al, and (γ) is a single column which does not pair with any column (see Fig.1(a)). Specially, the non-terminals, start, bifurcation and end do not produce any column. The production rules can be categorized into three classes: the pair rules, the single rules and the others. We define them as follows. The pair rules: pair βw β (probability: t(pair, w )e(β, β )). The single rules: single γw (probability: t(single, w )e(γ)), single w γ (probability: t(single, w )e(γ)). The others: start w (probability: t(start, w), w {pair, si ngle}), bifurcation start start (probability: 1). Fig.1. Modeling RNA secondary structure using the defined profile SCFG. (a) RNA alignment of three sequences. (b) Consensus structure for this alignment. For simplicity, only the first sequence is shown. (c) Parsing by the profile SCFG for this structure. (d) Parse tree for this parsing. Here start starts a parsing, end terminates a parsing and bifurcation is used for splitting into multiple stems and multi-branch loops. The non-terminal single produces a single column in non-structural regions of the alignment. The non-terminal pair produces a column-pair in structural regions of the alignment. The consensus secondary structure of an RNA alignment can be modeled by the profile SCFG defined above. As shown in Fig.1, Fig.1(a) is a three-way alignment; Fig.1(b) is the consensus secondary structure

3 584 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 which contains two stems as shown in Fig.1(a); Fig.1(c) is the parsing of the structure shown in Fig.1(b); and Fig.1(d) is the parse tree for the parsing. Note that there are possibly more than one consensus structures for an RNA alignment and each of them corresponds to one parsing by the profile SCFG. For simplicity, only one of possible structures is presented here. The SCFG presented here has a great difference from other SCFGs, i.e., when a production rule is applied, not a single base or base-pair but a single column or column-pair is generated. By assigning probabilities to each production rule, each parse tree can be assigned an overall probability, which is the product of the probabilities of all production rules used in the tree. Therefore, the RNA secondary structure can be evaluated by the parse tree given by the SCFG. The probability of an RNA secondary structure can be computed by: P (σ M) = P (Tree M) = l 2 j=1 l 1 i=1 t j (single, w )e(γ) t i (pair, w)e(β, β ) l 3 k=1 t k, w, w W. (1) Here, Tree is the parse tree that uses l 1 pair rules, l 2 single rules and l 3 others; t k is the probability of the rule from the others; σ is the secondary structure which can be parsed by Tree given the profile SCFG, M. 2.2 Phylogenic Analysis of Homologous RNA Sequences Using HMMs Evolutionary substitution models have long been used for phylogenic inference and for the study of molecular evolution. Here, we first present two novel evolutionary models using theory of HMMs, and then use them to perform phylogenic analysis. Our method takes a phylogenic tree of relating sequences as shown in Fig.2(b) as input, and assumes that all the sequences in the alignment evolve independently. Suppose D denotes an RNA alignment containing n sequences of length N; α i denotes the i-th column of D; α j denotes the j-th row (i.e., the j-th sequence) of D; α j i denotes the base at the i, j site of D. Here, 1 i N and 1 j n. Suppose Tr is the phylogenic tree inferred by the n sequences and R is the set of edges on the tree. Then, Tr is a tree with n leaves with sequence j at leaf j. As shown in Fig.2, Fig.2(a) is an RNA alignment of three sequences, and Fig.2(b) is one of the possible phylogenic trees inferred by the sequences. Suppose the nodes on Tr are numbered from 1 to 2n 1 and the last is the root node. Then the probability of α i given by the tree Tr, P (α i Tr, R), can be evaluated by: P (α i Tr, R) = P (α 1 i, α 2 i,..., α n i Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a F (k), r k ). Fig.2. (a) RNA alignment of three sequences. (b) Phylogenic tree with three leaves with sequence j at leaf j (1 j 3). For simplicity, only the evolutional substitution process for one of single columns (labeled by the box in (a)) is shown. Here, R = {r 1, r 2,..., r 2n 2 }; q 2n 1 is the initial probability of the base at the root node; F (k) is the immediate ancestor node to k (1 k 2n 2); a k (1 k 2n 1) denotes the base at the node k; P (a k a F (k), r k ) denotes the probability of a k arising from a F (k) over r k (1 k 2n 2). Specially, a k = α j i when 1 k n. The sum is over all possible assignments of bases a k to non-leaf nodes k (n + 1 k 2n 1). However, (2) is only suitable for ungapped alignments because it does not take deletions and insertions into account. To solve this problem, we model the evolutionary substitution process using two distinct HMMs. We define the evolutionary model for single column with the five-tuple λ = {S, V, π, T, E }. Here, λ is the HMM; S is the set of states; V is the set of observed characters; π is the initial probability vector of the states; T is the matrix of transition probabilities; E is the matrix of the emission probabilities. We define them as follows. S = {match, delete, insert}, where match symbolizes the matching state, delete symbolizes the deletion state and insert symbolizes the insertion state. V = {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. (2)

4 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 585 T = [t (x, y)], where x, y S, and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of single base for the emission probabilities. Let e x,y(b, b ) denote the probability of b evolving to b when x changes to y. Here x, y S, and b, b V. Here, the purpose of the states match, delete and insert is for the following three cases: b and b implies x = match. b and b = implies x = delete. b = and b implies x = insert. For the leaf node j (1 j n), let A j denote the set of ancestor nodes to j and R j denote the set of corresponding edges. Suppose the node j has m(j) ancestor nodes. Thus A j = {2n 1, A(1), A(2),, A(m(j) 1)}. Here, 2n 1 is the root node and A(k) is the immediate ancestor node to A(k + 1) for 1 k m(j) 2. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1 (y) = q 2n 1 π match t (match, y)e match,y(a 2n 1, a A(1) ), y S. (3) Specially, we define π match = 1 because the root node is always at the matching state. We simplify the recurrence process with: f k (y) = x S f k 1 (x)t (x, y)e x,y(a A(k 1), a A(k) ), x, y S; 2 k m(j) 1. (4) The probability of α j i given by Tr and R j is thus computed by: P (α j i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1), α j i ), 1 i N; 1 j n. (5) (2) in the case of single columns (i.e., non-structural regions) is thus revised by: P (α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i Tr, R j), 1 i N. (6) Similarly, we define the evolutionary model for column-pair with another five-tuple λ = {S, V, π, T, E }. We define them as follows. S = {match, delete, insert}. V = {A, C, G, U, } {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. T = [t (x, y)], where x, y S and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of basepairs for the emission probabilities. Let e x,y (b 1 b 1, b 2 b 2) denotes the probability of b 1 b 1 evolving to b 2 b 2 when x changes to y. Here x, y S and b 1 b 1, b 2 b 2 V. Here, the purpose of the states match, delete and insert is for the following three cases. If none of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = match. If at least one of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = delete. If at least one in b 1 b 1 changes to one of {A, C, G, U} when b 1 b 1 evolves to b 2 b 2, then x = insert. Let α i α i denote one of the column-pairs of D and α j i αj i denote the base-pair from α iα i at the j-th sequence (i.e., the leaf node j) of D. Here, 1 j n and 1 i, i N. Also, let a k a k (1 k 2n 1) denote the base-pair at the node k. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1(y) = q 2n 1π match t (match, y) e match,y(a 2n 1 a 2n 1, a A(1) a A(1) ), y S. (7) Here, q 2n 1 is the initial probability of the base-pair at the root node and also π match = 1. The recurrence process is simplified by: f k(y) = x S f k 1(x)t (x, y)e x,y(a A(k 1) a A(k 1), a A(k) a A(k) ), x, y S; 2 k m(j) 1. (8) The probability of α j i αj i given by Tr and R j is thus computed by: P (α j i αj i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1) a A(m(j) 1), a j i αj i ) 1 i, i N; 1 j n. (9) (2) in the case of column-pairs (i.e., structural regions) is thus revised by: P (α i α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i αj i Tr, R j) 1 i, i N. (10)

5 586 J. Comput. Sci. & Technol., July 2008, Vol.23, No Secondary Structure Prediction Using Combined Model We construct a combined model by revising M using λ and λ. In more detail, we revise (1) with: P (σ M, λ, λ ) = P (Tree M, λ, λ ) = l 1 i=1 l 2 j=1 t i (pair, w)e(β, β )P (ββ Tr, R) t j (single, w )e(γ)p (γ Tr, R) l 3 k=1 t k, w, w W. (11) Here, P (ββ Tr, R) is computed by (10) and P (γ Tr, R) is computed by (6). Let σ max denote the optimal secondary structure for the alignment D, then: σ max = argmax P (σ M, λ, λ ). (12) ó σ max is found using the CYK [15] algorithm on the profile SCFG M. 3 Results 3.1 Parameters Estimation The free parameters of the combined model are respectively given by the parameters of M, λ and λ. To estimate the parameters, we construct the training data sets using the known consensus secondary structures of ncrna families taken from the Rfam database. Here, we build the training data set together with the testing data set and report them in Table 1. Specifically, Rfam contains seed alignments of multiple RNA sequences, and consensus secondary structures for each alignment either taken from a previously published study in the literature or predicted using automated covariance-based methods. To establish gold-standard data for training and testing, we Table 1. Distribution of the Training and Testing Data Sets Id (%) < # Family # Sequence # Three-way # Four-way # Five-way Note: Id: percentage identity; # Family: number of RNA families; # Sequence: number of RNA sequences; # Threeway: number of three-way alignments; # Four-way: number of four-way alignments; # Five-way: number of five-way alignments. first removed all seed alignments with only predicted secondary structures, retaining the 150 families with secondary structures from the literature. The end result was a set of 150 independent examples, each taken from a different RNA family. For each selected seed alignment, we randomly extract three, four and five different sequences from the ncrna family. For example, we can obtain 10 (i.e., C5) 3 three-way alignments, 5 (i.e., C5) 4 four-way alignments, and 1 (i.e., C5) 5 five-alignments in total from a family of 5 sequences. In this way, we construct one set of three-way alignments, one set of four-way alignments and one set of five-way alignments. These data sets are used for training the models M, λ and λ. The sequence identity of the alignment is ranging from 40% to 99%. The distribution of the training and testing data sets is shown in Table 1. For the data sets shown in Table 1, the alignments from 100 families are selected to build training data set and the alignments from other 50 families are used for building testing data set. As for the parameters of M, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the inside-outside algorithm [15] (an expectation maximization procedure) is performed on the training set of secondary structures. In more detail, the number of times each rule is used is respectively counted from the training set, and then the probability of each rule is computed by dividing the number of times it is used by overall numbers of times all rules are used. For estimating the e(γ) probabilities of E, the single column frequencies are estimated from counts of the column in the non-structural regions in the alignments of the training set. Thus, overall single columns frequencies are determined and hence the probability of each kind of single column is also determined. The e(β, β ) probabilities of E are estimated by performing similar operations with e(γ). The only difference is that not single column but column-pairs are counted from the structural regions of the alignments. As for the parameters of λ, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the forward-backward algorithm [15] is performed on the evolutionary process of each sequence in the alignments of the training set. For the probabilities of E, recall that we substitute the mutation probabilities of bases for the emission probabilities. For estimating mutation rates, a number of sequences from the training set are paired. The single-base positions in these sequence pairs are examined and all differences between the se-

6 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 587 quences are counted. For example, if a given position has base b in one sequence and base b in the other, both the counter for b changing to b and the counter for b changing to b are incremented. As for the parameters of λ, the set of transition distributions T is estimated as we have done for T. And the set of emission distributions E is estimated as we have done for E. The only difference is that not single base but base pairs are examined and counted from the sequence pairs mentioned above. 3.2 Tests for Multiple Alignments of RNA Sequences As mentioned in Subsection 3.1, the testing data set is also constructed from the Rfam database. Actually, we first choose 50 ncrna families which are not included by the training set from Table 1, and then build sets of alignments as we do for the training set. The sequence identity of the alignment is still ranging from 40% to 99%. The testing data sets are still composed of one set of three-way alignments, one set of four-way alignments, and one set of five-way alignments. These alignments are also used as the input for Pfold. To test our method and compare it with Pfold, we evaluate the predictions by each method using the sensitivity and specificity of predicted base-pairs. Actually, we compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives, and the specificity as the number of true positives divided by the sum of true positives + false positives. Table 2. Sensitivity and Specificity on Data Set of Three-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. The results of our method are compared with those of Pfold, and are reported in Tables 2 4. For these tables, the second and third columns are the results of our method, followed by the results of Pfold. As shown in the tables, our method exhibits higher both sensitivity and specificity than Pfold, especially for alignments with lower sequence identity. One significant finding about the results is that our method exhibits much higher accuracy than Pfold when many gaps are introduced into the input alignment. Another valuable finding is that our method exhibits higher accuracy when more sequences are included. Actually, the best accuracy of our method (sensitivity 83.55%, specificity 81.68%) is achieved in tests for five-way alignments with 50 60% sequence identity. Table 3. Sensitivity and Specificity on Data Set of Four-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. Table 4. Sensitivity and Specificity on Data Set of Five-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 3.3 Comparison with the Model Without delete and insert Here, we provide a comparison of the model presented in Subsection 2.2 with a version of the model with the delete and insert states disabled for the alignment model. As for the model without delete and insert, we still use (2) to compute the probability of one single column from non-structural regions of the alignment. And we use the following equation to compute one column-pair from structural regions of the alignment: P (α i α i Tr, R) = P (αi 1 αi 1, α2 i αi 2,..., αn i αi n Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a k a F (k) a F (k), r k ). (13)

7 588 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 Here, α i α i still denote one of the column-pairs of D, and a k a k (1 k 2n 1) denote the base-pair at the node k. And P (a k a k a F (k) a F (k), r k ) denotes the probability of a k a k arising from a F (k) a F (k) over r k (1 k 2n 2). We still use (11) and (12) to compute the optimal secondary structure for the alignment D. But the P (ββ Tr, R) in (11) is computed by (13) and P (γ Tr, R) is computed by (2). We selected some five-way alignments containing more gaps from the testing data sets built in Subsection 3.2 and tested the model mentioned above on them. The results are compared with the full model described in Subsection 2.3, and are reported in Table 5. The second and third columns in the table are the results for the full method, followed by the results for the model without delete and insert. Table 5. Testing for the Full Model and the Model Without delete and insert on Five-Way Alignments Id (%) Se Full (%) Sp Full (%) Se (%) Sp (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 4 Discussion In this paper, we devise, implement and evaluate a new SCFG-based method for predicting RNA secondary structure by introducing more complicated phylogenic analysis of homologous sequences. Our method improves the prediction by modeling RNA secondary structures using the newly defined profile SCFG and by performing phylogenic analysis of RNA sequences using more evolutionary information including deletion and insertion. The new profile SCFG presented here makes our method greatly differ from other methods. The phylogenic analysis of RNA sequences using two distinct HMMs makes our method robust. The fact that our method takes gapped RNA alignment as input makes it much suitable for practical application. Despite the limited amount of data, we have shown in the experiments that our method can predict RNA secondary structure with a better performance than Pfold. Our method exhibits higher accuracy than Pfold, especially for the alignments with more sequences, more gaps, lower sequence identity but higher structure identity. This should be the correct behavior, because our method takes much information about evolutionary process into account, while Pfold uses simplified evolutionary model to analyze the sequences in the alignment. On the other hand, both methods exhibit similar accuracy for alignments with much higher sequence identity (> 90%). This is true because little evolutionary information can be obtained from these alignments. As a result, our method loses some advantages relative to Pfold. In conclusion, our method can predict RNA secondary structure with better performance than Pfold. As for the comparison of the full model with the model without delete and insert (see Table 5), we conclude that the states delete and insert can help improve the prediction of RNA secondary structure in the case of more sequences and more gaps. And more benchmark should be performed for providing more precise conclusion. There are some differences between the method presented here and Pfold. First, two distinct HMMs, λ and λ, are proposed for performing phylogenic analysis of single column and column pairs of the RNA sequence alignment. Here, both λ and λ include the states match, delete and insert. But Pfold only has the state match (from the view of HMM) because it looks on the gap as the fifth character which is not from {A, C, G, U}. Hence, much evolutionary information such as deletion and insertion is omitted by Pfold. Second, the HMMs presented here respectively model evolutionary substitution processes of single column and column pairs. But Pfold does not take into account the difference between single column and column pairs. Third, we define and train a new profile SCFG to model RNA secondary structure. Especially, we present the concept of single column and column pairs. Finally, we merge the new SCFG and the two HMMs into a combined model to predict RNA secondary structure. Actually, the method presented here can be regarded as an improvement of Pfold. There are several ways in which our method could be improved. First, the fact that our method takes fixed alignment as input makes it vulnerable to alignment errors. Unfortunately, it is difficult to construct accurate alignment of RNA sequences without knowing the secondary structures. One way to solve this problem is to consider some structural information when aligning RNA sequences. We have presented a fast and practical method for detecting and assessing secondary structure in RNA alignment in [18]. In future work, we will apply it to constructing high-quality alignments, which can further be used to improve the prediction of RNA secondary structure. Second, our method takes a phylogenic tree relating the sequences as input. If the tree is not given, it must be estimated from the model. Estimating tree topology can be done by an

8 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 589 exhaustive search, a branch and bound method or a heuristic method. The choice will be highly dependent on the number of sequences in the alignment, considering the fast rate of growth in the number of trees with respect to the number of sequences. Third, one perhaps concerns about the time complexity of the method, especially for modeling the secondary structure using the new profile SCFG. To solve this problem, we can devise new parallel algorithms for modeling and predicting RNA secondary structures. But this needs further studying to maintain the accuracy of the method. Finally, we hope that our method will find use in studies on both biology and bioinformatics. References [1] Storz G. An expanding universe of noncoding RNAs. Science, 2002, 296(5571): [2] Eddy S R. Non-coding RNA genes and modern RNA world. Nat. RevGenet, 2001, 2(12): [3] Huttenhofer A, Schattner P, Polacek N. Non-coding RNAs: Hope or hype? TRENDS in Genetics, 2005, 21(5): [4] Furtig B et al. NMR spectroscopy of RNA. Chembiochem, 2003, 4(10): [5] Gardner P P, Giegerich G. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 2004, 5: [6] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 1981, 9(1): [7] Hofacker I, Fekete M, Stadler P. Secondary structure prediction for aligned RNA sequences. Journal of Molecular Biology, 2002, 319(5): [8] Sakakibara Y et al. Stochastic context-free grammars for trna modeling. Nucleic Acids Research, 1994, 22(23): [9] Eddy S R, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Research, 1994, 22(11): [10] Dowell R, Eddy S. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 2004, 5: [11] Dowell R, Eddy S. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 2006, 7: [12] Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics, 1999, 15(6): [13] Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research, 2003, 31(13): [14] Do C B, Woods D A, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 2006, 22(14): e90 e98. [15] Durbin R et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press, 1998, pp [16] Pace N R, Thomas B C, Woese C R. Probing RNA Structure, Function, and History by Comparative Analysis. The RNA World, 2nd edition, NY: Cold Spring Harbor Laboratory Press, 1999, pp [17] Sam G J, Alex B et al. Rfam: An RNA family database. Nucleic Acids Research, 2003, 31(1): [18] Xiaoyong Fang et al. The detection and assessment of possible RNA secondary structure using multiple sequence alignment. In Proc. the 22nd Annual ACM Symposium on Applied Computing, Seoul, Korea, March 11 15, 2007, pp Xiao-Yong Fang is currently a Ph.D. candidate in School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, computational biology, computational intelligence and highperformance computing. He has published several papers in international conferences such as the 22nd Annual ACM Symposium on Applied Computing and several papers in international journal such as Bioinformation. Zhi-Gang Luo is currently a professor of School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, parallel algorithms, artificial intelligence, machine learning and high-performance computing in application domains of biomedical informatics. He has published over 40 journal or conference papers. He has served as a reviewer for over 5 international conferences. He is a program committee member of the Fifth International Workshop on Advanced Parallel Processing Technologies, Xiamen, China, 2005, and the Fifth International Conference on Grid and Cooperative Computing, Hunan, China, Zheng-Hua Wang is a professor of computer science at the National University of Defense Technology, China. His current research interests are computer system performance evaluation, parallel processing, and bioinformatics. He received the B.Eng., M.S. and Ph.D. degrees in mechanics from the National University of Defense Technology in 1983, 1988, and 1992 respectively.

CS681: Advanced Topics in Computational Biology

CS681: Advanced Topics in Computational Biology CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr Week 10 Lecture 1 http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ RNA folding Prediction of secondary structure

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable RNA STRUCTURE RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base. 23 Hydrogen Bonds more stable RNA Basics transfer RNA (trna) messenger

More information

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2011 Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach Jianping

More information

Improved TBL algorithm for learning context-free grammar

Improved TBL algorithm for learning context-free grammar Proceedings of the International Multiconference on ISSN 1896-7094 Computer Science and Information Technology, pp. 267 274 2007 PIPS Improved TBL algorithm for learning context-free grammar Marcin Jaworski

More information

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Proceedings of Machine Learning Research vol 73:153-164, 2017 AMBN 2017 On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Kei Amii Kyoto University Kyoto

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Multiple Sequence Alignment using Profile HMM

Multiple Sequence Alignment using Profile HMM Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

Hidden Markov Models

Hidden Markov Models Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly

More information

MS2a, Exercises Week 8, Model Solution

MS2a, Exercises Week 8, Model Solution Ma, Exercises Week 8, Model olution Rune Lyngsø December, 009 A Hidden Markov Model Use a. Consider a hidden Markov model emitting sequences over the alphabet {A,C,G,T}. The model has two states, 0 and,

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

RNA Secondary Structure Prediction

RNA Secondary Structure Prediction RNA Secondary Structure Prediction 1 RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2 The Nussinov-Jacobson Algorithm q = 9

More information

Characterising RNA secondary structure space using information entropy

Characterising RNA secondary structure space using information entropy Characterising RNA secondary structure space using information entropy Zsuzsanna Sükösd 1,2,3, Bjarne Knudsen 4, James WJ Anderson 5, Ádám Novák 5,6, Jørgen Kjems 2,3 and Christian NS Pedersen 1,7 1 Bioinformatics

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Stephen Scott.

Stephen Scott. 1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Lecture 10: December 22, 2009

Lecture 10: December 22, 2009 Computational Genomics Fall Semester, 2009 Lecture 0: December 22, 2009 Lecturer: Roded Sharan Scribe: Adam Weinstock and Yaara Ben-Amram Segre 0. Context Free Grammars 0.. Introduction In lecture 6, we

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i 第 4 章 : 进化树构建的概率方法 问题介绍 进化树构建方法的概率方法 部分 lid 修改自 i i f l 的 ih l i 部分 Slides 修改自 University of Basel 的 Michael Springmann 课程 CS302 Seminar Life Science Informatics 的讲义 Phylogenetic Tree branch internal node

More information

order is number of previous outputs

order is number of previous outputs Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y

More information

11.3 Decoding Algorithm

11.3 Decoding Algorithm 11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Comparison of Cost Functions in Sequence Alignment. Ryan Healey Comparison of Cost Functions in Sequence Alignment Ryan Healey Use of Cost Functions Used to score and align sequences Mathematically model how sequences mutate and evolve. Evolution and mutation can be

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6) Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Supplementary Materials for

Supplementary Materials for advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo

More information

De novo prediction of structural noncoding RNAs

De novo prediction of structural noncoding RNAs 1/ 38 De novo prediction of structural noncoding RNAs Stefan Washietl 18.417 - Fall 2011 2/ 38 Outline Motivation: Biological importance of (noncoding) RNAs Algorithms to predict structural noncoding RNAs

More information

Pairwise sequence alignment and pair hidden Markov models

Pairwise sequence alignment and pair hidden Markov models Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there

More information

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

More information

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics 1. Introduction. Jorge L. Ortiz Department of Electrical and Computer Engineering College of Engineering University of Puerto

More information

Some Problems from Enzyme Families

Some Problems from Enzyme Families Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems

More information

Applications of Hidden Markov Models

Applications of Hidden Markov Models 18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation

More information

Lecture 21: Spectral Learning for Graphical Models

Lecture 21: Spectral Learning for Graphical Models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation

More information

Overview Multiple Sequence Alignment

Overview Multiple Sequence Alignment Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Evolutionary Tree Analysis. Overview

Evolutionary Tree Analysis. Overview CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES

CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES Byung-Jun Yoon, Student Member, IEEE, and P. P. Vaidyanathan*, Fellow, IEEE January 8, 2006 Affiliation:

More information

Hidden Markov Models Part 2: Algorithms

Hidden Markov Models Part 2: Algorithms Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:

More information

Computational Genomics and Molecular Biology, Fall

Computational Genomics and Molecular Biology, Fall Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models Supplementary Material for CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models Chuong B Do, Daniel A Woods, and Serafim Batzoglou Stanford University, Stanford, CA 94305, USA, {chuongdo,danwoods,serafim}@csstanfordedu,

More information

Mitochondrial Genome Annotation

Mitochondrial Genome Annotation Protein Genes 1,2 1 Institute of Bioinformatics University of Leipzig 2 Department of Bioinformatics Lebanese University TBI Bled 2015 Outline Introduction Mitochondrial DNA Problem Tools Training Annotation

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Parametrized Stochastic Grammars for RNA Secondary Structure Prediction

Parametrized Stochastic Grammars for RNA Secondary Structure Prediction Parametrized Stochastic Grammars for RNA Secondary Structure Prediction Robert S. Maier Departments of Mathematics and Physics University of Arizona Tucson, AZ 85721, USA Email: rsm@math.arizona.edu Abstract

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17: Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm

More information

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Anomaly Detection for the CERN Large Hadron Collider injection magnets Anomaly Detection for the CERN Large Hadron Collider injection magnets Armin Halilovic KU Leuven - Department of Computer Science In cooperation with CERN 2018-07-27 0 Outline 1 Context 2 Data 3 Preprocessing

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Shape Based Indexing For Faster Search Of RNA Family Databases

Shape Based Indexing For Faster Search Of RNA Family Databases For Faster Search Of RNA Family Databases Stefan Janssen Jens Reeder Robert Giegerich 26. April 2008 RNA homology Why? build homologous groups find new group members How? sequence & structure Covariance

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Lecture 7 Sequence analysis. Hidden Markov Models

Lecture 7 Sequence analysis. Hidden Markov Models Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden

More information

Detecting non-coding RNA in Genomic Sequences

Detecting non-coding RNA in Genomic Sequences Detecting non-coding RNA in Genomic Sequences I. Overview of ncrnas II. What s specific about RNA detection? III. Looking for known RNAs IV. Looking for unknown RNAs Daniel Gautheret INSERM ERM 206 & Université

More information

An ant colony algorithm for multiple sequence alignment in bioinformatics

An ant colony algorithm for multiple sequence alignment in bioinformatics An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk

More information

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT 5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:

More information

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College

More information

Approximate inference for stochastic dynamics in large biological networks

Approximate inference for stochastic dynamics in large biological networks MID-TERM REVIEW Institut Henri Poincaré, Paris 23-24 January 2014 Approximate inference for stochastic dynamics in large biological networks Ludovica Bachschmid Romano Supervisor: Prof. Manfred Opper Artificial

More information