Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis
|
|
- Maximillian Skinner
- 6 years ago
- Views:
Transcription
1 Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): July 2008 Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Xiao-Yong Fang ( ), Zhi-Gang Luo ( ), and Zheng-Hua Wang ( ) School of Computer Science, National University of Defense Technology, Changsha , China xyfang@nudt.edu.cn; zgluo@nudt.edu.cn Received July 2, 2007; revised May 3, Abstract Stochastic context-free grammars (SCFGs) have been applied to predicting RNA secondary structure. The prediction of RNA secondary structure can be facilitated by incorporating with comparative sequence analysis. However, most of existing SCFG-based methods lack explicit phylogenic analysis of homologous RNA sequences, which is probably the reason why these methods are not ideal in practical application. Hence, we present a new SCFG-based method by integrating phylogenic analysis with the newly defined profile SCFG. The method can be summarized as: 1) we define a new profile SCFG, M, to depict consensus secondary structure of multiple RNA sequence alignment; 2) we introduce two distinct hidden Markov models, λ and λ, to perform phylogenic analysis of homologous RNA sequences. Here, λ is for non-structural regions of the sequence and λ is for structural regions of the sequence; 3) we merge λ and λ into M to devise a combined model for prediction of RNA secondary structure. We tested our method on data sets constructed from the Rfam database. The sensitivity and specificity of our method are more accurate than those of the predictions by Pfold. Keywords RNA secondary structure, stochastic context-free grammar, phylogenic analysis 1 Introduction In recent years, non-coding RNAs (ncrnas) have gained increasing interest since a huge variety of functions associated with them were found [1 3]. The function of an RNA molecule is principally determined by its (secondary) structure. To date, experimental approaches constitute the most reliable methods for secondary structure determination [4]. Unfortunately, their difficulty and expense are often prohibitive, especially for high-throughput applications. For this reason, computational prediction provides an attractive alternative to empirical discovery of RNA secondary structure [5]. Most algorithms for RNA secondary structure prediction are based on Minimum Free Energy (MFE) [6,7]. However, there are several independent reasons why the accuracy of MFE structure prediction is limited in practice [5]. Recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for predicting RNA secondary structure [8 15]. The advantage of this kind of methods is that they are more readily extended to include other sources of statistical information that constrain a structure prediction. In general, the most powerful source of information for RNA structure prediction is probably comparative sequence analysis [16], which uses evolutionary information. Hence, there is a need for automated approaches that combine evolutionary information from comparative sequence analysis with probabilistic modeling approaches using stochastic contextfree grammars. However, most of existing SCFG-based methods except Pfold [13,14] do not explicitly take evolutionary information of RNA sequences into account. Though Pfold has achieved better results by using its so-called phylo-scfgs, much evolutionary information such as deletion and insertion is omitted for simplifying the predicting algorithm. This is probably the reason why Pfold is not ideal for some low-quality alignments, especially for the alignments containing many gaps in the sequences. In this paper, we present a new combined method for secondary structure prediction by integrating phylogenic analysis with SCFGs. Here, more information about evolutionary process of RNA sequence is considered for computing optimum secondary structure. We define a new profile SCFG by introducing some new variables and production rules. We perform phylogenic analysis of RNA sequences using two distinct evolutio- Regular Paper Supported by the National Natural Science Foundation of China under Grant No
2 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 583 nary substitution models. Here, the theory of HMM [15] is employed to take deletion and insertion into account. We devise a combined model for structure prediction by integrating the evolutionary models with the new profile SCFG. We tested our method on the data sets of the known RNA sequence alignments downloaded from the Rfam database [17]. The comparison with Pfold shows that our method can predict RNA secondary structure with a higher accuracy than Pfold. 2 Methods 2.1 Modeling RNA Secondary Structure Using Profile SCFGs An SCFG can be viewed as a device capable of both generating and parsing strings [15]. The SCFG described here takes gapped sequence alignment as input and thus is called profile SCFG. Here, we introduce the profile SCFG from the parsing point of view. The profile SCFG consists of a set of variables, some terminal and some non-terminal. Terminals correspond to columns of the input alignment and non-terminals correspond to the states of the grammar. The non-terminals are rewritten according to a set of production rules. Each production rule specifies a single non-terminal and a set of columns (if any) that it should be changed to. Successively applying the production rules of the grammar until all columns of the input alignment have been described is called parsing, which defines a parse tree. Most literature on SCFGs assumes the grammar to be in Chomsky [15] normal form for the algorithms to be used. However, the profile SCFG presented here will decompose productions into two independent parts: non-terminal transitions and terminal emissions. The profile SCFG can thus be defined by the four-tuple M = {W, T, Al, E}. Here, M is the profile SCFG; W is the set of non-terminals; T is the matrix of transition distributions; Al is the set of terminals; and E is the matrix of emission distributions. We define them as follows. W = {start, bifurcation, single, pair, end}. T = [t(w, w )], where w, w W, and t(w, w ) is the transition probability from w to w. Al = {A, C, G, U, } n, where n is the number of the sequences in the alignment and symbolizes the gap. E = [e w ], where w W. If w = pair, then e w = e(β, β ). Here, β, β Al, and (β, β ) is a columnpair which means two paired columns (see Fig.1(a)). If w = single, then e w = e(γ). Here, γ Al, and (γ) is a single column which does not pair with any column (see Fig.1(a)). Specially, the non-terminals, start, bifurcation and end do not produce any column. The production rules can be categorized into three classes: the pair rules, the single rules and the others. We define them as follows. The pair rules: pair βw β (probability: t(pair, w )e(β, β )). The single rules: single γw (probability: t(single, w )e(γ)), single w γ (probability: t(single, w )e(γ)). The others: start w (probability: t(start, w), w {pair, si ngle}), bifurcation start start (probability: 1). Fig.1. Modeling RNA secondary structure using the defined profile SCFG. (a) RNA alignment of three sequences. (b) Consensus structure for this alignment. For simplicity, only the first sequence is shown. (c) Parsing by the profile SCFG for this structure. (d) Parse tree for this parsing. Here start starts a parsing, end terminates a parsing and bifurcation is used for splitting into multiple stems and multi-branch loops. The non-terminal single produces a single column in non-structural regions of the alignment. The non-terminal pair produces a column-pair in structural regions of the alignment. The consensus secondary structure of an RNA alignment can be modeled by the profile SCFG defined above. As shown in Fig.1, Fig.1(a) is a three-way alignment; Fig.1(b) is the consensus secondary structure
3 584 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 which contains two stems as shown in Fig.1(a); Fig.1(c) is the parsing of the structure shown in Fig.1(b); and Fig.1(d) is the parse tree for the parsing. Note that there are possibly more than one consensus structures for an RNA alignment and each of them corresponds to one parsing by the profile SCFG. For simplicity, only one of possible structures is presented here. The SCFG presented here has a great difference from other SCFGs, i.e., when a production rule is applied, not a single base or base-pair but a single column or column-pair is generated. By assigning probabilities to each production rule, each parse tree can be assigned an overall probability, which is the product of the probabilities of all production rules used in the tree. Therefore, the RNA secondary structure can be evaluated by the parse tree given by the SCFG. The probability of an RNA secondary structure can be computed by: P (σ M) = P (Tree M) = l 2 j=1 l 1 i=1 t j (single, w )e(γ) t i (pair, w)e(β, β ) l 3 k=1 t k, w, w W. (1) Here, Tree is the parse tree that uses l 1 pair rules, l 2 single rules and l 3 others; t k is the probability of the rule from the others; σ is the secondary structure which can be parsed by Tree given the profile SCFG, M. 2.2 Phylogenic Analysis of Homologous RNA Sequences Using HMMs Evolutionary substitution models have long been used for phylogenic inference and for the study of molecular evolution. Here, we first present two novel evolutionary models using theory of HMMs, and then use them to perform phylogenic analysis. Our method takes a phylogenic tree of relating sequences as shown in Fig.2(b) as input, and assumes that all the sequences in the alignment evolve independently. Suppose D denotes an RNA alignment containing n sequences of length N; α i denotes the i-th column of D; α j denotes the j-th row (i.e., the j-th sequence) of D; α j i denotes the base at the i, j site of D. Here, 1 i N and 1 j n. Suppose Tr is the phylogenic tree inferred by the n sequences and R is the set of edges on the tree. Then, Tr is a tree with n leaves with sequence j at leaf j. As shown in Fig.2, Fig.2(a) is an RNA alignment of three sequences, and Fig.2(b) is one of the possible phylogenic trees inferred by the sequences. Suppose the nodes on Tr are numbered from 1 to 2n 1 and the last is the root node. Then the probability of α i given by the tree Tr, P (α i Tr, R), can be evaluated by: P (α i Tr, R) = P (α 1 i, α 2 i,..., α n i Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a F (k), r k ). Fig.2. (a) RNA alignment of three sequences. (b) Phylogenic tree with three leaves with sequence j at leaf j (1 j 3). For simplicity, only the evolutional substitution process for one of single columns (labeled by the box in (a)) is shown. Here, R = {r 1, r 2,..., r 2n 2 }; q 2n 1 is the initial probability of the base at the root node; F (k) is the immediate ancestor node to k (1 k 2n 2); a k (1 k 2n 1) denotes the base at the node k; P (a k a F (k), r k ) denotes the probability of a k arising from a F (k) over r k (1 k 2n 2). Specially, a k = α j i when 1 k n. The sum is over all possible assignments of bases a k to non-leaf nodes k (n + 1 k 2n 1). However, (2) is only suitable for ungapped alignments because it does not take deletions and insertions into account. To solve this problem, we model the evolutionary substitution process using two distinct HMMs. We define the evolutionary model for single column with the five-tuple λ = {S, V, π, T, E }. Here, λ is the HMM; S is the set of states; V is the set of observed characters; π is the initial probability vector of the states; T is the matrix of transition probabilities; E is the matrix of the emission probabilities. We define them as follows. S = {match, delete, insert}, where match symbolizes the matching state, delete symbolizes the deletion state and insert symbolizes the insertion state. V = {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. (2)
4 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 585 T = [t (x, y)], where x, y S, and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of single base for the emission probabilities. Let e x,y(b, b ) denote the probability of b evolving to b when x changes to y. Here x, y S, and b, b V. Here, the purpose of the states match, delete and insert is for the following three cases: b and b implies x = match. b and b = implies x = delete. b = and b implies x = insert. For the leaf node j (1 j n), let A j denote the set of ancestor nodes to j and R j denote the set of corresponding edges. Suppose the node j has m(j) ancestor nodes. Thus A j = {2n 1, A(1), A(2),, A(m(j) 1)}. Here, 2n 1 is the root node and A(k) is the immediate ancestor node to A(k + 1) for 1 k m(j) 2. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1 (y) = q 2n 1 π match t (match, y)e match,y(a 2n 1, a A(1) ), y S. (3) Specially, we define π match = 1 because the root node is always at the matching state. We simplify the recurrence process with: f k (y) = x S f k 1 (x)t (x, y)e x,y(a A(k 1), a A(k) ), x, y S; 2 k m(j) 1. (4) The probability of α j i given by Tr and R j is thus computed by: P (α j i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1), α j i ), 1 i N; 1 j n. (5) (2) in the case of single columns (i.e., non-structural regions) is thus revised by: P (α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i Tr, R j), 1 i N. (6) Similarly, we define the evolutionary model for column-pair with another five-tuple λ = {S, V, π, T, E }. We define them as follows. S = {match, delete, insert}. V = {A, C, G, U, } {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. T = [t (x, y)], where x, y S and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of basepairs for the emission probabilities. Let e x,y (b 1 b 1, b 2 b 2) denotes the probability of b 1 b 1 evolving to b 2 b 2 when x changes to y. Here x, y S and b 1 b 1, b 2 b 2 V. Here, the purpose of the states match, delete and insert is for the following three cases. If none of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = match. If at least one of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = delete. If at least one in b 1 b 1 changes to one of {A, C, G, U} when b 1 b 1 evolves to b 2 b 2, then x = insert. Let α i α i denote one of the column-pairs of D and α j i αj i denote the base-pair from α iα i at the j-th sequence (i.e., the leaf node j) of D. Here, 1 j n and 1 i, i N. Also, let a k a k (1 k 2n 1) denote the base-pair at the node k. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1(y) = q 2n 1π match t (match, y) e match,y(a 2n 1 a 2n 1, a A(1) a A(1) ), y S. (7) Here, q 2n 1 is the initial probability of the base-pair at the root node and also π match = 1. The recurrence process is simplified by: f k(y) = x S f k 1(x)t (x, y)e x,y(a A(k 1) a A(k 1), a A(k) a A(k) ), x, y S; 2 k m(j) 1. (8) The probability of α j i αj i given by Tr and R j is thus computed by: P (α j i αj i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1) a A(m(j) 1), a j i αj i ) 1 i, i N; 1 j n. (9) (2) in the case of column-pairs (i.e., structural regions) is thus revised by: P (α i α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i αj i Tr, R j) 1 i, i N. (10)
5 586 J. Comput. Sci. & Technol., July 2008, Vol.23, No Secondary Structure Prediction Using Combined Model We construct a combined model by revising M using λ and λ. In more detail, we revise (1) with: P (σ M, λ, λ ) = P (Tree M, λ, λ ) = l 1 i=1 l 2 j=1 t i (pair, w)e(β, β )P (ββ Tr, R) t j (single, w )e(γ)p (γ Tr, R) l 3 k=1 t k, w, w W. (11) Here, P (ββ Tr, R) is computed by (10) and P (γ Tr, R) is computed by (6). Let σ max denote the optimal secondary structure for the alignment D, then: σ max = argmax P (σ M, λ, λ ). (12) ó σ max is found using the CYK [15] algorithm on the profile SCFG M. 3 Results 3.1 Parameters Estimation The free parameters of the combined model are respectively given by the parameters of M, λ and λ. To estimate the parameters, we construct the training data sets using the known consensus secondary structures of ncrna families taken from the Rfam database. Here, we build the training data set together with the testing data set and report them in Table 1. Specifically, Rfam contains seed alignments of multiple RNA sequences, and consensus secondary structures for each alignment either taken from a previously published study in the literature or predicted using automated covariance-based methods. To establish gold-standard data for training and testing, we Table 1. Distribution of the Training and Testing Data Sets Id (%) < # Family # Sequence # Three-way # Four-way # Five-way Note: Id: percentage identity; # Family: number of RNA families; # Sequence: number of RNA sequences; # Threeway: number of three-way alignments; # Four-way: number of four-way alignments; # Five-way: number of five-way alignments. first removed all seed alignments with only predicted secondary structures, retaining the 150 families with secondary structures from the literature. The end result was a set of 150 independent examples, each taken from a different RNA family. For each selected seed alignment, we randomly extract three, four and five different sequences from the ncrna family. For example, we can obtain 10 (i.e., C5) 3 three-way alignments, 5 (i.e., C5) 4 four-way alignments, and 1 (i.e., C5) 5 five-alignments in total from a family of 5 sequences. In this way, we construct one set of three-way alignments, one set of four-way alignments and one set of five-way alignments. These data sets are used for training the models M, λ and λ. The sequence identity of the alignment is ranging from 40% to 99%. The distribution of the training and testing data sets is shown in Table 1. For the data sets shown in Table 1, the alignments from 100 families are selected to build training data set and the alignments from other 50 families are used for building testing data set. As for the parameters of M, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the inside-outside algorithm [15] (an expectation maximization procedure) is performed on the training set of secondary structures. In more detail, the number of times each rule is used is respectively counted from the training set, and then the probability of each rule is computed by dividing the number of times it is used by overall numbers of times all rules are used. For estimating the e(γ) probabilities of E, the single column frequencies are estimated from counts of the column in the non-structural regions in the alignments of the training set. Thus, overall single columns frequencies are determined and hence the probability of each kind of single column is also determined. The e(β, β ) probabilities of E are estimated by performing similar operations with e(γ). The only difference is that not single column but column-pairs are counted from the structural regions of the alignments. As for the parameters of λ, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the forward-backward algorithm [15] is performed on the evolutionary process of each sequence in the alignments of the training set. For the probabilities of E, recall that we substitute the mutation probabilities of bases for the emission probabilities. For estimating mutation rates, a number of sequences from the training set are paired. The single-base positions in these sequence pairs are examined and all differences between the se-
6 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 587 quences are counted. For example, if a given position has base b in one sequence and base b in the other, both the counter for b changing to b and the counter for b changing to b are incremented. As for the parameters of λ, the set of transition distributions T is estimated as we have done for T. And the set of emission distributions E is estimated as we have done for E. The only difference is that not single base but base pairs are examined and counted from the sequence pairs mentioned above. 3.2 Tests for Multiple Alignments of RNA Sequences As mentioned in Subsection 3.1, the testing data set is also constructed from the Rfam database. Actually, we first choose 50 ncrna families which are not included by the training set from Table 1, and then build sets of alignments as we do for the training set. The sequence identity of the alignment is still ranging from 40% to 99%. The testing data sets are still composed of one set of three-way alignments, one set of four-way alignments, and one set of five-way alignments. These alignments are also used as the input for Pfold. To test our method and compare it with Pfold, we evaluate the predictions by each method using the sensitivity and specificity of predicted base-pairs. Actually, we compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives, and the specificity as the number of true positives divided by the sum of true positives + false positives. Table 2. Sensitivity and Specificity on Data Set of Three-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. The results of our method are compared with those of Pfold, and are reported in Tables 2 4. For these tables, the second and third columns are the results of our method, followed by the results of Pfold. As shown in the tables, our method exhibits higher both sensitivity and specificity than Pfold, especially for alignments with lower sequence identity. One significant finding about the results is that our method exhibits much higher accuracy than Pfold when many gaps are introduced into the input alignment. Another valuable finding is that our method exhibits higher accuracy when more sequences are included. Actually, the best accuracy of our method (sensitivity 83.55%, specificity 81.68%) is achieved in tests for five-way alignments with 50 60% sequence identity. Table 3. Sensitivity and Specificity on Data Set of Four-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. Table 4. Sensitivity and Specificity on Data Set of Five-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 3.3 Comparison with the Model Without delete and insert Here, we provide a comparison of the model presented in Subsection 2.2 with a version of the model with the delete and insert states disabled for the alignment model. As for the model without delete and insert, we still use (2) to compute the probability of one single column from non-structural regions of the alignment. And we use the following equation to compute one column-pair from structural regions of the alignment: P (α i α i Tr, R) = P (αi 1 αi 1, α2 i αi 2,..., αn i αi n Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a k a F (k) a F (k), r k ). (13)
7 588 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 Here, α i α i still denote one of the column-pairs of D, and a k a k (1 k 2n 1) denote the base-pair at the node k. And P (a k a k a F (k) a F (k), r k ) denotes the probability of a k a k arising from a F (k) a F (k) over r k (1 k 2n 2). We still use (11) and (12) to compute the optimal secondary structure for the alignment D. But the P (ββ Tr, R) in (11) is computed by (13) and P (γ Tr, R) is computed by (2). We selected some five-way alignments containing more gaps from the testing data sets built in Subsection 3.2 and tested the model mentioned above on them. The results are compared with the full model described in Subsection 2.3, and are reported in Table 5. The second and third columns in the table are the results for the full method, followed by the results for the model without delete and insert. Table 5. Testing for the Full Model and the Model Without delete and insert on Five-Way Alignments Id (%) Se Full (%) Sp Full (%) Se (%) Sp (%) < Total Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 4 Discussion In this paper, we devise, implement and evaluate a new SCFG-based method for predicting RNA secondary structure by introducing more complicated phylogenic analysis of homologous sequences. Our method improves the prediction by modeling RNA secondary structures using the newly defined profile SCFG and by performing phylogenic analysis of RNA sequences using more evolutionary information including deletion and insertion. The new profile SCFG presented here makes our method greatly differ from other methods. The phylogenic analysis of RNA sequences using two distinct HMMs makes our method robust. The fact that our method takes gapped RNA alignment as input makes it much suitable for practical application. Despite the limited amount of data, we have shown in the experiments that our method can predict RNA secondary structure with a better performance than Pfold. Our method exhibits higher accuracy than Pfold, especially for the alignments with more sequences, more gaps, lower sequence identity but higher structure identity. This should be the correct behavior, because our method takes much information about evolutionary process into account, while Pfold uses simplified evolutionary model to analyze the sequences in the alignment. On the other hand, both methods exhibit similar accuracy for alignments with much higher sequence identity (> 90%). This is true because little evolutionary information can be obtained from these alignments. As a result, our method loses some advantages relative to Pfold. In conclusion, our method can predict RNA secondary structure with better performance than Pfold. As for the comparison of the full model with the model without delete and insert (see Table 5), we conclude that the states delete and insert can help improve the prediction of RNA secondary structure in the case of more sequences and more gaps. And more benchmark should be performed for providing more precise conclusion. There are some differences between the method presented here and Pfold. First, two distinct HMMs, λ and λ, are proposed for performing phylogenic analysis of single column and column pairs of the RNA sequence alignment. Here, both λ and λ include the states match, delete and insert. But Pfold only has the state match (from the view of HMM) because it looks on the gap as the fifth character which is not from {A, C, G, U}. Hence, much evolutionary information such as deletion and insertion is omitted by Pfold. Second, the HMMs presented here respectively model evolutionary substitution processes of single column and column pairs. But Pfold does not take into account the difference between single column and column pairs. Third, we define and train a new profile SCFG to model RNA secondary structure. Especially, we present the concept of single column and column pairs. Finally, we merge the new SCFG and the two HMMs into a combined model to predict RNA secondary structure. Actually, the method presented here can be regarded as an improvement of Pfold. There are several ways in which our method could be improved. First, the fact that our method takes fixed alignment as input makes it vulnerable to alignment errors. Unfortunately, it is difficult to construct accurate alignment of RNA sequences without knowing the secondary structures. One way to solve this problem is to consider some structural information when aligning RNA sequences. We have presented a fast and practical method for detecting and assessing secondary structure in RNA alignment in [18]. In future work, we will apply it to constructing high-quality alignments, which can further be used to improve the prediction of RNA secondary structure. Second, our method takes a phylogenic tree relating the sequences as input. If the tree is not given, it must be estimated from the model. Estimating tree topology can be done by an
8 Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 589 exhaustive search, a branch and bound method or a heuristic method. The choice will be highly dependent on the number of sequences in the alignment, considering the fast rate of growth in the number of trees with respect to the number of sequences. Third, one perhaps concerns about the time complexity of the method, especially for modeling the secondary structure using the new profile SCFG. To solve this problem, we can devise new parallel algorithms for modeling and predicting RNA secondary structures. But this needs further studying to maintain the accuracy of the method. Finally, we hope that our method will find use in studies on both biology and bioinformatics. References [1] Storz G. An expanding universe of noncoding RNAs. Science, 2002, 296(5571): [2] Eddy S R. Non-coding RNA genes and modern RNA world. Nat. RevGenet, 2001, 2(12): [3] Huttenhofer A, Schattner P, Polacek N. Non-coding RNAs: Hope or hype? TRENDS in Genetics, 2005, 21(5): [4] Furtig B et al. NMR spectroscopy of RNA. Chembiochem, 2003, 4(10): [5] Gardner P P, Giegerich G. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 2004, 5: [6] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 1981, 9(1): [7] Hofacker I, Fekete M, Stadler P. Secondary structure prediction for aligned RNA sequences. Journal of Molecular Biology, 2002, 319(5): [8] Sakakibara Y et al. Stochastic context-free grammars for trna modeling. Nucleic Acids Research, 1994, 22(23): [9] Eddy S R, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Research, 1994, 22(11): [10] Dowell R, Eddy S. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 2004, 5: [11] Dowell R, Eddy S. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 2006, 7: [12] Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics, 1999, 15(6): [13] Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research, 2003, 31(13): [14] Do C B, Woods D A, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 2006, 22(14): e90 e98. [15] Durbin R et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press, 1998, pp [16] Pace N R, Thomas B C, Woese C R. Probing RNA Structure, Function, and History by Comparative Analysis. The RNA World, 2nd edition, NY: Cold Spring Harbor Laboratory Press, 1999, pp [17] Sam G J, Alex B et al. Rfam: An RNA family database. Nucleic Acids Research, 2003, 31(1): [18] Xiaoyong Fang et al. The detection and assessment of possible RNA secondary structure using multiple sequence alignment. In Proc. the 22nd Annual ACM Symposium on Applied Computing, Seoul, Korea, March 11 15, 2007, pp Xiao-Yong Fang is currently a Ph.D. candidate in School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, computational biology, computational intelligence and highperformance computing. He has published several papers in international conferences such as the 22nd Annual ACM Symposium on Applied Computing and several papers in international journal such as Bioinformation. Zhi-Gang Luo is currently a professor of School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, parallel algorithms, artificial intelligence, machine learning and high-performance computing in application domains of biomedical informatics. He has published over 40 journal or conference papers. He has served as a reviewer for over 5 international conferences. He is a program committee member of the Fifth International Workshop on Advanced Parallel Processing Technologies, Xiamen, China, 2005, and the Fifth International Conference on Grid and Cooperative Computing, Hunan, China, Zheng-Hua Wang is a professor of computer science at the National University of Defense Technology, China. His current research interests are computer system performance evaluation, parallel processing, and bioinformatics. He received the B.Eng., M.S. and Ph.D. degrees in mechanics from the National University of Defense Technology in 1983, 1988, and 1992 respectively.
CS681: Advanced Topics in Computational Biology
CS681: Advanced Topics in Computational Biology Can Alkan EA224 calkan@cs.bilkent.edu.tr Week 10 Lecture 1 http://www.cs.bilkent.edu.tr/~calkan/teaching/cs681/ RNA folding Prediction of secondary structure
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary
More informationRNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable
RNA STRUCTURE RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base. 23 Hydrogen Bonds more stable RNA Basics transfer RNA (trna) messenger
More informationSemi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach
Wright State University CORE Scholar Browse all Theses and Dissertations Theses and Dissertations 2011 Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach Jianping
More informationImproved TBL algorithm for learning context-free grammar
Proceedings of the International Multiconference on ISSN 1896-7094 Computer Science and Information Technology, pp. 267 274 2007 PIPS Improved TBL algorithm for learning context-free grammar Marcin Jaworski
More informationOn the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar
Proceedings of Machine Learning Research vol 73:153-164, 2017 AMBN 2017 On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar Kei Amii Kyoto University Kyoto
More informationRNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"
RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationProtein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche
Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its
More informationMultiple Sequence Alignment using Profile HMM
Multiple Sequence Alignment using Profile HMM. based on Chapter 5 and Section 6.5 from Biological Sequence Analysis by R. Durbin et al., 1998 Acknowledgements: M.Sc. students Beatrice Miron, Oana Răţoi,
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang
More information3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationHMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington
More informationExpectation Maximization (EM)
Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted
More information98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006
98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.
More informationHidden Markov Models
Andrea Passerini passerini@disi.unitn.it Statistical relational learning The aim Modeling temporal sequences Model signals which vary over time (e.g. speech) Two alternatives: deterministic models directly
More informationMS2a, Exercises Week 8, Model Solution
Ma, Exercises Week 8, Model olution Rune Lyngsø December, 009 A Hidden Markov Model Use a. Consider a hidden Markov model emitting sequences over the alphabet {A,C,G,T}. The model has two states, 0 and,
More informationAn Introduction to Sequence Similarity ( Homology ) Searching
An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,
More informationHidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationBIOINFORMATICS: An Introduction
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
More informationMULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE
MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE Manmeet Kaur 1, Navneet Kaur Bawa 2 1 M-tech research scholar (CSE Dept) ACET, Manawala,Asr 2 Associate Professor (CSE Dept) ACET, Manawala,Asr
More informationComparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes
More informationRNA Secondary Structure Prediction
RNA Secondary Structure Prediction 1 RNA structure prediction methods Base-Pair Maximization Context-Free Grammar Parsing. Free Energy Methods Covariance Models 2 The Nussinov-Jacobson Algorithm q = 9
More informationCharacterising RNA secondary structure space using information entropy
Characterising RNA secondary structure space using information entropy Zsuzsanna Sükösd 1,2,3, Bjarne Knudsen 4, James WJ Anderson 5, Ádám Novák 5,6, Jørgen Kjems 2,3 and Christian NS Pedersen 1,7 1 Bioinformatics
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationStephen Scott.
1 / 21 sscott@cse.unl.edu 2 / 21 Introduction Designed to model (profile) a multiple alignment of a protein family (e.g., Fig. 5.1) Gives a probabilistic model of the proteins in the family Useful for
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationLecture 10: December 22, 2009
Computational Genomics Fall Semester, 2009 Lecture 0: December 22, 2009 Lecturer: Roded Sharan Scribe: Adam Weinstock and Yaara Ben-Amram Segre 0. Context Free Grammars 0.. Introduction In lecture 6, we
More informationLecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010
Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition
More informationRNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17
RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationCISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)
CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models
More informationHidden Markov Models and Their Applications in Biological Sequence Analysis
Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract
More informationProtein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.
Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein
More informationSequence Alignment (chapter 6)
Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:
More informationSTRUCTURAL BIOINFORMATICS I. Fall 2015
STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;
More information进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i
第 4 章 : 进化树构建的概率方法 问题介绍 进化树构建方法的概率方法 部分 lid 修改自 i i f l 的 ih l i 部分 Slides 修改自 University of Basel 的 Michael Springmann 课程 CS302 Seminar Life Science Informatics 的讲义 Phylogenetic Tree branch internal node
More informationorder is number of previous outputs
Markov Models Lecture : Markov and Hidden Markov Models PSfrag Use past replacements as state. Next output depends on previous output(s): y t = f[y t, y t,...] order is number of previous outputs y t y
More information11.3 Decoding Algorithm
11.3 Decoding Algorithm 393 For convenience, we have introduced π 0 and π n+1 as the fictitious initial and terminal states begin and end. This model defines the probability P(x π) for a given sequence
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationHMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder
HMM applications Applications of HMMs Gene finding Pairwise alignment (pair HMMs) Characterizing protein families (profile HMMs) Predicting membrane proteins, and membrane protein topology Gene finding
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationComparison of Cost Functions in Sequence Alignment. Ryan Healey
Comparison of Cost Functions in Sequence Alignment Ryan Healey Use of Cost Functions Used to score and align sequences Mathematically model how sequences mutate and evolve. Evolution and mutation can be
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationBackground: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)
Sequence lignment (chapter ) he biological problem lobal alignment Local alignment Multiple alignment Background: comparative genomics Basic question in biology: what properties are shared among organisms?
More informationPage 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence
Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)
More informationSupplementary Materials for
advances.sciencemag.org/cgi/content/full/1/8/e1500527/dc1 Supplementary Materials for A phylogenomic data-driven exploration of viral origins and evolution The PDF file includes: Arshan Nasir and Gustavo
More informationDe novo prediction of structural noncoding RNAs
1/ 38 De novo prediction of structural noncoding RNAs Stefan Washietl 18.417 - Fall 2011 2/ 38 Outline Motivation: Biological importance of (noncoding) RNAs Algorithms to predict structural noncoding RNAs
More informationPairwise sequence alignment and pair hidden Markov models
Pairwise sequence alignment and pair hidden Markov models Martin C. Frith April 13, 2012 ntroduction Pairwise alignment and pair hidden Markov models (phmms) are basic textbook fare [2]. However, there
More informationCopyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained
Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.
More informationApplication of Associative Matrices to Recognize DNA Sequences in Bioinformatics
Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics 1. Introduction. Jorge L. Ortiz Department of Electrical and Computer Engineering College of Engineering University of Puerto
More informationSome Problems from Enzyme Families
Some Problems from Enzyme Families Greg Butler Department of Computer Science Concordia University, Montreal www.cs.concordia.ca/~faculty/gregb gregb@cs.concordia.ca Abstract I will discuss some problems
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2011 1 HMM Lecture Notes Dannie Durand and Rose Hoberman October 11th 1 Hidden Markov Models In the last few lectures, we have focussed on three problems
More informationApplications of Hidden Markov Models
18.417 Introduction to Computational Molecular Biology Lecture 18: November 9, 2004 Scribe: Chris Peikert Lecturer: Ross Lippert Editor: Chris Peikert Applications of Hidden Markov Models Review of Notation
More informationLecture 21: Spectral Learning for Graphical Models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 Lecture 21: Spectral Learning for Graphical Models Lecturer: Eric P. Xing Scribes: Maruan Al-Shedivat, Wei-Cheng Chang, Frederick Liu 1 Motivation
More informationOverview Multiple Sequence Alignment
Overview Multiple Sequence Alignment Inge Jonassen Bioinformatics group Dept. of Informatics, UoB Inge.Jonassen@ii.uib.no Definition/examples Use of alignments The alignment problem scoring alignments
More informationConditional Random Fields: An Introduction
University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania
More informationToday s Lecture: HMMs
Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models
More informationEvolutionary Models. Evolutionary Models
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
More informationEvolutionary Tree Analysis. Overview
CSI/BINF 5330 Evolutionary Tree Analysis Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Backgrounds Distance-Based Evolutionary Tree Reconstruction Character-Based
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationA profile-based protein sequence alignment algorithm for a domain clustering database
A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing
More informationCONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES
CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES Byung-Jun Yoon, Student Member, IEEE, and P. P. Vaidyanathan*, Fellow, IEEE January 8, 2006 Affiliation:
More informationHidden Markov Models Part 2: Algorithms
Hidden Markov Models Part 2: Algorithms CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Hidden Markov Model An HMM consists of:
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2014 1 HMM Lecture Notes Dannie Durand and Rose Hoberman November 6th Introduction In the last few lectures, we have focused on three problems related
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationCopyright 2000 N. AYDIN. All rights reserved. 1
Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment
More informationCONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models
Supplementary Material for CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models Chuong B Do, Daniel A Woods, and Serafim Batzoglou Stanford University, Stanford, CA 94305, USA, {chuongdo,danwoods,serafim}@csstanfordedu,
More informationMitochondrial Genome Annotation
Protein Genes 1,2 1 Institute of Bioinformatics University of Leipzig 2 Department of Bioinformatics Lebanese University TBI Bled 2015 Outline Introduction Mitochondrial DNA Problem Tools Training Annotation
More informationPairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55
Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationBioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre
Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationParametrized Stochastic Grammars for RNA Secondary Structure Prediction
Parametrized Stochastic Grammars for RNA Secondary Structure Prediction Robert S. Maier Departments of Mathematics and Physics University of Arizona Tucson, AZ 85721, USA Email: rsm@math.arizona.edu Abstract
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More informationMultiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:
Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:50 5001 5 Multiple Sequence Alignment The first part of this exposition is based on the following sources, which are recommended reading:
More informationCAP 5510 Lecture 3 Protein Structures
CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity
More informationWeek 10: Homology Modelling (II) - HHpred
Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationHidden Markov Models
Hidden Markov Models Outline CG-islands The Fair Bet Casino Hidden Markov Model Decoding Algorithm Forward-Backward Algorithm Profile HMMs HMM Parameter Estimation Viterbi training Baum-Welch algorithm
More informationAnomaly Detection for the CERN Large Hadron Collider injection magnets
Anomaly Detection for the CERN Large Hadron Collider injection magnets Armin Halilovic KU Leuven - Department of Computer Science In cooperation with CERN 2018-07-27 0 Outline 1 Context 2 Data 3 Preprocessing
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationIn-Depth Assessment of Local Sequence Alignment
2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.
More informationShape Based Indexing For Faster Search Of RNA Family Databases
For Faster Search Of RNA Family Databases Stefan Janssen Jens Reeder Robert Giegerich 26. April 2008 RNA homology Why? build homologous groups find new group members How? sequence & structure Covariance
More information3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT
3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode
More informationGibbs Sampling Methods for Multiple Sequence Alignment
Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationLecture 7 Sequence analysis. Hidden Markov Models
Lecture 7 Sequence analysis. Hidden Markov Models Nicolas Lartillot may 2012 Nicolas Lartillot (Universite de Montréal) BIN6009 may 2012 1 / 60 1 Motivation 2 Examples of Hidden Markov models 3 Hidden
More informationDetecting non-coding RNA in Genomic Sequences
Detecting non-coding RNA in Genomic Sequences I. Overview of ncrnas II. What s specific about RNA detection? III. Looking for known RNAs IV. Looking for unknown RNAs Daniel Gautheret INSERM ERM 206 & Université
More informationAn ant colony algorithm for multiple sequence alignment in bioinformatics
An ant colony algorithm for multiple sequence alignment in bioinformatics Jonathan Moss and Colin G. Johnson Computing Laboratory University of Kent at Canterbury Canterbury, Kent, CT2 7NF, England. C.G.Johnson@ukc.ac.uk
More information5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT
5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT.03.239 03.10.2012 ALIGNMENT Alignment is the task of locating equivalent regions of two or more sequences to maximize their similarity. Homology:
More informationAn Evolution Strategy for the Induction of Fuzzy Finite-state Automata
Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College
More informationApproximate inference for stochastic dynamics in large biological networks
MID-TERM REVIEW Institut Henri Poincaré, Paris 23-24 January 2014 Approximate inference for stochastic dynamics in large biological networks Ludovica Bachschmid Romano Supervisor: Prof. Manfred Opper Artificial
More information