Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis

Similar documents
CS681: Advanced Topics in Computational Biology

Algorithms in Bioinformatics

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

Semi-Supervised CONTRAfold for RNA Secondary Structure Prediction: A Maximum Entropy Approach

Improved TBL algorithm for learning context-free grammar

On the Sizes of Decision Diagrams Representing the Set of All Parse Trees of a Context-free Grammar

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Multiple Sequence Alignment using Profile HMM

EECS730: Introduction to Bioinformatics

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Expectation Maximization (EM)

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

Hidden Markov Models

MS2a, Exercises Week 8, Model Solution

An Introduction to Sequence Similarity ( Homology ) Searching

Hidden Markov Models

An Introduction to Bioinformatics Algorithms Hidden Markov Models

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Methods for NLP

BIOINFORMATICS: An Introduction

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

RNA Secondary Structure Prediction

Characterising RNA secondary structure space using information entropy

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Stephen Scott.

O 3 O 4 O 5. q 3. q 4. Transition

Lecture 10: December 22, 2009

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

Dr. Amira A. AL-Hosary

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Sequence Alignment (chapter 6)

STRUCTURAL BIOINFORMATICS I. Fall 2015

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

order is number of previous outputs

11.3 Decoding Algorithm

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

HMM applications. Applications of HMMs. Gene finding with HMMs. Using the gene finder

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Markov Chains and Hidden Markov Models. = stochastic, generative models

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

EVOLUTIONARY DISTANCES

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Supplementary Materials for

De novo prediction of structural noncoding RNAs

Pairwise sequence alignment and pair hidden Markov models

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

Application of Associative Matrices to Recognize DNA Sequences in Bioinformatics

Some Problems from Enzyme Families

Computational Genomics and Molecular Biology, Fall

Applications of Hidden Markov Models

Lecture 21: Spectral Learning for Graphical Models

Overview Multiple Sequence Alignment

Conditional Random Fields: An Introduction

Today s Lecture: HMMs

Evolutionary Models. Evolutionary Models

Evolutionary Tree Analysis. Overview

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

A profile-based protein sequence alignment algorithm for a domain clustering database

CONTEXT-SENSITIVE HIDDEN MARKOV MODELS FOR MODELING LONG-RANGE DEPENDENCIES IN SYMBOL SEQUENCES

Hidden Markov Models Part 2: Algorithms

Computational Genomics and Molecular Biology, Fall

Algorithms in Bioinformatics

Copyright 2000 N. AYDIN. All rights reserved. 1

CONTRAfold: RNA Secondary Structure Prediction without Physics-Based Models

Mitochondrial Genome Annotation

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Model Accuracy Measures

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Parametrized Stochastic Grammars for RNA Secondary Structure Prediction

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

CAP 5510 Lecture 3 Protein Structures

Week 10: Homology Modelling (II) - HHpred

Graph Alignment and Biological Networks

Hidden Markov Models

Anomaly Detection for the CERN Large Hadron Collider injection magnets

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

Shape Based Indexing For Faster Search Of RNA Family Databases

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Gibbs Sampling Methods for Multiple Sequence Alignment

Computational Biology: Basics & Interesting Problems

Lecture 7 Sequence analysis. Hidden Markov Models

Detecting non-coding RNA in Genomic Sequences

An ant colony algorithm for multiple sequence alignment in bioinformatics

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

Approximate inference for stochastic dynamics in large biological networks

Transcription:

Fang XY, Luo ZG, Wang ZH. Predicting RNA secondary structure using profile stochastic context-free grammars and phylogenic analysis. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 23(4): 582 589 July 2008 Predicting RNA Secondary Structure Using Profile Stochastic Context-Free Grammars and Phylogenic Analysis Xiao-Yong Fang ( ), Zhi-Gang Luo ( ), and Zheng-Hua Wang ( ) School of Computer Science, National University of Defense Technology, Changsha 410073, China E-mail: xyfang@nudt.edu.cn; zgluo@nudt.edu.cn Received July 2, 2007; revised May 3, 2008. Abstract Stochastic context-free grammars (SCFGs) have been applied to predicting RNA secondary structure. The prediction of RNA secondary structure can be facilitated by incorporating with comparative sequence analysis. However, most of existing SCFG-based methods lack explicit phylogenic analysis of homologous RNA sequences, which is probably the reason why these methods are not ideal in practical application. Hence, we present a new SCFG-based method by integrating phylogenic analysis with the newly defined profile SCFG. The method can be summarized as: 1) we define a new profile SCFG, M, to depict consensus secondary structure of multiple RNA sequence alignment; 2) we introduce two distinct hidden Markov models, λ and λ, to perform phylogenic analysis of homologous RNA sequences. Here, λ is for non-structural regions of the sequence and λ is for structural regions of the sequence; 3) we merge λ and λ into M to devise a combined model for prediction of RNA secondary structure. We tested our method on data sets constructed from the Rfam database. The sensitivity and specificity of our method are more accurate than those of the predictions by Pfold. Keywords RNA secondary structure, stochastic context-free grammar, phylogenic analysis 1 Introduction In recent years, non-coding RNAs (ncrnas) have gained increasing interest since a huge variety of functions associated with them were found [1 3]. The function of an RNA molecule is principally determined by its (secondary) structure. To date, experimental approaches constitute the most reliable methods for secondary structure determination [4]. Unfortunately, their difficulty and expense are often prohibitive, especially for high-throughput applications. For this reason, computational prediction provides an attractive alternative to empirical discovery of RNA secondary structure [5]. Most algorithms for RNA secondary structure prediction are based on Minimum Free Energy (MFE) [6,7]. However, there are several independent reasons why the accuracy of MFE structure prediction is limited in practice [5]. Recently, stochastic context-free grammars (SCFGs) have emerged as an alternative probabilistic methodology for predicting RNA secondary structure [8 15]. The advantage of this kind of methods is that they are more readily extended to include other sources of statistical information that constrain a structure prediction. In general, the most powerful source of information for RNA structure prediction is probably comparative sequence analysis [16], which uses evolutionary information. Hence, there is a need for automated approaches that combine evolutionary information from comparative sequence analysis with probabilistic modeling approaches using stochastic contextfree grammars. However, most of existing SCFG-based methods except Pfold [13,14] do not explicitly take evolutionary information of RNA sequences into account. Though Pfold has achieved better results by using its so-called phylo-scfgs, much evolutionary information such as deletion and insertion is omitted for simplifying the predicting algorithm. This is probably the reason why Pfold is not ideal for some low-quality alignments, especially for the alignments containing many gaps in the sequences. In this paper, we present a new combined method for secondary structure prediction by integrating phylogenic analysis with SCFGs. Here, more information about evolutionary process of RNA sequence is considered for computing optimum secondary structure. We define a new profile SCFG by introducing some new variables and production rules. We perform phylogenic analysis of RNA sequences using two distinct evolutio- Regular Paper Supported by the National Natural Science Foundation of China under Grant No. 60673018.

Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 583 nary substitution models. Here, the theory of HMM [15] is employed to take deletion and insertion into account. We devise a combined model for structure prediction by integrating the evolutionary models with the new profile SCFG. We tested our method on the data sets of the known RNA sequence alignments downloaded from the Rfam database [17]. The comparison with Pfold shows that our method can predict RNA secondary structure with a higher accuracy than Pfold. 2 Methods 2.1 Modeling RNA Secondary Structure Using Profile SCFGs An SCFG can be viewed as a device capable of both generating and parsing strings [15]. The SCFG described here takes gapped sequence alignment as input and thus is called profile SCFG. Here, we introduce the profile SCFG from the parsing point of view. The profile SCFG consists of a set of variables, some terminal and some non-terminal. Terminals correspond to columns of the input alignment and non-terminals correspond to the states of the grammar. The non-terminals are rewritten according to a set of production rules. Each production rule specifies a single non-terminal and a set of columns (if any) that it should be changed to. Successively applying the production rules of the grammar until all columns of the input alignment have been described is called parsing, which defines a parse tree. Most literature on SCFGs assumes the grammar to be in Chomsky [15] normal form for the algorithms to be used. However, the profile SCFG presented here will decompose productions into two independent parts: non-terminal transitions and terminal emissions. The profile SCFG can thus be defined by the four-tuple M = {W, T, Al, E}. Here, M is the profile SCFG; W is the set of non-terminals; T is the matrix of transition distributions; Al is the set of terminals; and E is the matrix of emission distributions. We define them as follows. W = {start, bifurcation, single, pair, end}. T = [t(w, w )], where w, w W, and t(w, w ) is the transition probability from w to w. Al = {A, C, G, U, } n, where n is the number of the sequences in the alignment and symbolizes the gap. E = [e w ], where w W. If w = pair, then e w = e(β, β ). Here, β, β Al, and (β, β ) is a columnpair which means two paired columns (see Fig.1(a)). If w = single, then e w = e(γ). Here, γ Al, and (γ) is a single column which does not pair with any column (see Fig.1(a)). Specially, the non-terminals, start, bifurcation and end do not produce any column. The production rules can be categorized into three classes: the pair rules, the single rules and the others. We define them as follows. The pair rules: pair βw β (probability: t(pair, w )e(β, β )). The single rules: single γw (probability: t(single, w )e(γ)), single w γ (probability: t(single, w )e(γ)). The others: start w (probability: t(start, w), w {pair, si ngle}), bifurcation start start (probability: 1). Fig.1. Modeling RNA secondary structure using the defined profile SCFG. (a) RNA alignment of three sequences. (b) Consensus structure for this alignment. For simplicity, only the first sequence is shown. (c) Parsing by the profile SCFG for this structure. (d) Parse tree for this parsing. Here start starts a parsing, end terminates a parsing and bifurcation is used for splitting into multiple stems and multi-branch loops. The non-terminal single produces a single column in non-structural regions of the alignment. The non-terminal pair produces a column-pair in structural regions of the alignment. The consensus secondary structure of an RNA alignment can be modeled by the profile SCFG defined above. As shown in Fig.1, Fig.1(a) is a three-way alignment; Fig.1(b) is the consensus secondary structure

584 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 which contains two stems as shown in Fig.1(a); Fig.1(c) is the parsing of the structure shown in Fig.1(b); and Fig.1(d) is the parse tree for the parsing. Note that there are possibly more than one consensus structures for an RNA alignment and each of them corresponds to one parsing by the profile SCFG. For simplicity, only one of possible structures is presented here. The SCFG presented here has a great difference from other SCFGs, i.e., when a production rule is applied, not a single base or base-pair but a single column or column-pair is generated. By assigning probabilities to each production rule, each parse tree can be assigned an overall probability, which is the product of the probabilities of all production rules used in the tree. Therefore, the RNA secondary structure can be evaluated by the parse tree given by the SCFG. The probability of an RNA secondary structure can be computed by: P (σ M) = P (Tree M) = l 2 j=1 l 1 i=1 t j (single, w )e(γ) t i (pair, w)e(β, β ) l 3 k=1 t k, w, w W. (1) Here, Tree is the parse tree that uses l 1 pair rules, l 2 single rules and l 3 others; t k is the probability of the rule from the others; σ is the secondary structure which can be parsed by Tree given the profile SCFG, M. 2.2 Phylogenic Analysis of Homologous RNA Sequences Using HMMs Evolutionary substitution models have long been used for phylogenic inference and for the study of molecular evolution. Here, we first present two novel evolutionary models using theory of HMMs, and then use them to perform phylogenic analysis. Our method takes a phylogenic tree of relating sequences as shown in Fig.2(b) as input, and assumes that all the sequences in the alignment evolve independently. Suppose D denotes an RNA alignment containing n sequences of length N; α i denotes the i-th column of D; α j denotes the j-th row (i.e., the j-th sequence) of D; α j i denotes the base at the i, j site of D. Here, 1 i N and 1 j n. Suppose Tr is the phylogenic tree inferred by the n sequences and R is the set of edges on the tree. Then, Tr is a tree with n leaves with sequence j at leaf j. As shown in Fig.2, Fig.2(a) is an RNA alignment of three sequences, and Fig.2(b) is one of the possible phylogenic trees inferred by the sequences. Suppose the nodes on Tr are numbered from 1 to 2n 1 and the last is the root node. Then the probability of α i given by the tree Tr, P (α i Tr, R), can be evaluated by: P (α i Tr, R) = P (α 1 i, α 2 i,..., α n i Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a F (k), r k ). Fig.2. (a) RNA alignment of three sequences. (b) Phylogenic tree with three leaves with sequence j at leaf j (1 j 3). For simplicity, only the evolutional substitution process for one of single columns (labeled by the box in (a)) is shown. Here, R = {r 1, r 2,..., r 2n 2 }; q 2n 1 is the initial probability of the base at the root node; F (k) is the immediate ancestor node to k (1 k 2n 2); a k (1 k 2n 1) denotes the base at the node k; P (a k a F (k), r k ) denotes the probability of a k arising from a F (k) over r k (1 k 2n 2). Specially, a k = α j i when 1 k n. The sum is over all possible assignments of bases a k to non-leaf nodes k (n + 1 k 2n 1). However, (2) is only suitable for ungapped alignments because it does not take deletions and insertions into account. To solve this problem, we model the evolutionary substitution process using two distinct HMMs. We define the evolutionary model for single column with the five-tuple λ = {S, V, π, T, E }. Here, λ is the HMM; S is the set of states; V is the set of observed characters; π is the initial probability vector of the states; T is the matrix of transition probabilities; E is the matrix of the emission probabilities. We define them as follows. S = {match, delete, insert}, where match symbolizes the matching state, delete symbolizes the deletion state and insert symbolizes the insertion state. V = {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. (2)

Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 585 T = [t (x, y)], where x, y S, and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of single base for the emission probabilities. Let e x,y(b, b ) denote the probability of b evolving to b when x changes to y. Here x, y S, and b, b V. Here, the purpose of the states match, delete and insert is for the following three cases: b and b implies x = match. b and b = implies x = delete. b = and b implies x = insert. For the leaf node j (1 j n), let A j denote the set of ancestor nodes to j and R j denote the set of corresponding edges. Suppose the node j has m(j) ancestor nodes. Thus A j = {2n 1, A(1), A(2),, A(m(j) 1)}. Here, 2n 1 is the root node and A(k) is the immediate ancestor node to A(k + 1) for 1 k m(j) 2. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1 (y) = q 2n 1 π match t (match, y)e match,y(a 2n 1, a A(1) ), y S. (3) Specially, we define π match = 1 because the root node is always at the matching state. We simplify the recurrence process with: f k (y) = x S f k 1 (x)t (x, y)e x,y(a A(k 1), a A(k) ), x, y S; 2 k m(j) 1. (4) The probability of α j i given by Tr and R j is thus computed by: P (α j i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1), α j i ), 1 i N; 1 j n. (5) (2) in the case of single columns (i.e., non-structural regions) is thus revised by: P (α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i Tr, R j), 1 i N. (6) Similarly, we define the evolutionary model for column-pair with another five-tuple λ = {S, V, π, T, E }. We define them as follows. S = {match, delete, insert}. V = {A, C, G, U, } {A, C, G, U, }. π = {π match, π delete, π insert }, where π x is the initial probability of x and x S. T = [t (x, y)], where x, y S and t (x, y) is the transition probability from x to y. E is the matrix of the emission probabilities. But here, we substitute the mutation probabilities of basepairs for the emission probabilities. Let e x,y (b 1 b 1, b 2 b 2) denotes the probability of b 1 b 1 evolving to b 2 b 2 when x changes to y. Here x, y S and b 1 b 1, b 2 b 2 V. Here, the purpose of the states match, delete and insert is for the following three cases. If none of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = match. If at least one of characters in b 1 b 1 changes to the character when b 1 b 1 evolves to b 2 b 2, then x = delete. If at least one in b 1 b 1 changes to one of {A, C, G, U} when b 1 b 1 evolves to b 2 b 2, then x = insert. Let α i α i denote one of the column-pairs of D and α j i αj i denote the base-pair from α iα i at the j-th sequence (i.e., the leaf node j) of D. Here, 1 j n and 1 i, i N. Also, let a k a k (1 k 2n 1) denote the base-pair at the node k. For the leaf node j (1 j n), we set f as the recursion variable and initialize it with: f 1(y) = q 2n 1π match t (match, y) e match,y(a 2n 1 a 2n 1, a A(1) a A(1) ), y S. (7) Here, q 2n 1 is the initial probability of the base-pair at the root node and also π match = 1. The recurrence process is simplified by: f k(y) = x S f k 1(x)t (x, y)e x,y(a A(k 1) a A(k 1), a A(k) a A(k) ), x, y S; 2 k m(j) 1. (8) The probability of α j i αj i given by Tr and R j is thus computed by: P (α j i αj i Tr, R j) = f m(j) 1 (x)t (x, y) y S x S e x,y(a A(m(j) 1) a A(m(j) 1), a j i αj i ) 1 i, i N; 1 j n. (9) (2) in the case of column-pairs (i.e., structural regions) is thus revised by: P (α i α i Tr, R) = a n+1,a n+2,...,a 2n 1 j=1 n P (α j i αj i Tr, R j) 1 i, i N. (10)

586 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 2.3 Secondary Structure Prediction Using Combined Model We construct a combined model by revising M using λ and λ. In more detail, we revise (1) with: P (σ M, λ, λ ) = P (Tree M, λ, λ ) = l 1 i=1 l 2 j=1 t i (pair, w)e(β, β )P (ββ Tr, R) t j (single, w )e(γ)p (γ Tr, R) l 3 k=1 t k, w, w W. (11) Here, P (ββ Tr, R) is computed by (10) and P (γ Tr, R) is computed by (6). Let σ max denote the optimal secondary structure for the alignment D, then: σ max = argmax P (σ M, λ, λ ). (12) ó σ max is found using the CYK [15] algorithm on the profile SCFG M. 3 Results 3.1 Parameters Estimation The free parameters of the combined model are respectively given by the parameters of M, λ and λ. To estimate the parameters, we construct the training data sets using the known consensus secondary structures of ncrna families taken from the Rfam database. Here, we build the training data set together with the testing data set and report them in Table 1. Specifically, Rfam contains seed alignments of multiple RNA sequences, and consensus secondary structures for each alignment either taken from a previously published study in the literature or predicted using automated covariance-based methods. To establish gold-standard data for training and testing, we Table 1. Distribution of the Training and Testing Data Sets Id (%) < 50 50 60 60 70 70 80 80 90 90 100 # Family 10 15 20 45 40 20 # Sequence 100 120 200 400 300 180 # Three-way 30 36 60 100 90 50 # Four-way 20 24 40 60 50 30 # Five-way 20 24 40 60 50 30 Note: Id: percentage identity; # Family: number of RNA families; # Sequence: number of RNA sequences; # Threeway: number of three-way alignments; # Four-way: number of four-way alignments; # Five-way: number of five-way alignments. first removed all seed alignments with only predicted secondary structures, retaining the 150 families with secondary structures from the literature. The end result was a set of 150 independent examples, each taken from a different RNA family. For each selected seed alignment, we randomly extract three, four and five different sequences from the ncrna family. For example, we can obtain 10 (i.e., C5) 3 three-way alignments, 5 (i.e., C5) 4 four-way alignments, and 1 (i.e., C5) 5 five-alignments in total from a family of 5 sequences. In this way, we construct one set of three-way alignments, one set of four-way alignments and one set of five-way alignments. These data sets are used for training the models M, λ and λ. The sequence identity of the alignment is ranging from 40% to 99%. The distribution of the training and testing data sets is shown in Table 1. For the data sets shown in Table 1, the alignments from 100 families are selected to build training data set and the alignments from other 50 families are used for building testing data set. As for the parameters of M, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the inside-outside algorithm [15] (an expectation maximization procedure) is performed on the training set of secondary structures. In more detail, the number of times each rule is used is respectively counted from the training set, and then the probability of each rule is computed by dividing the number of times it is used by overall numbers of times all rules are used. For estimating the e(γ) probabilities of E, the single column frequencies are estimated from counts of the column in the non-structural regions in the alignments of the training set. Thus, overall single columns frequencies are determined and hence the probability of each kind of single column is also determined. The e(β, β ) probabilities of E are estimated by performing similar operations with e(γ). The only difference is that not single column but column-pairs are counted from the structural regions of the alignments. As for the parameters of λ, the set of transition distributions T and the set of emission distributions E are estimated. For estimating the probabilities of T, the forward-backward algorithm [15] is performed on the evolutionary process of each sequence in the alignments of the training set. For the probabilities of E, recall that we substitute the mutation probabilities of bases for the emission probabilities. For estimating mutation rates, a number of sequences from the training set are paired. The single-base positions in these sequence pairs are examined and all differences between the se-

Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 587 quences are counted. For example, if a given position has base b in one sequence and base b in the other, both the counter for b changing to b and the counter for b changing to b are incremented. As for the parameters of λ, the set of transition distributions T is estimated as we have done for T. And the set of emission distributions E is estimated as we have done for E. The only difference is that not single base but base pairs are examined and counted from the sequence pairs mentioned above. 3.2 Tests for Multiple Alignments of RNA Sequences As mentioned in Subsection 3.1, the testing data set is also constructed from the Rfam database. Actually, we first choose 50 ncrna families which are not included by the training set from Table 1, and then build sets of alignments as we do for the training set. The sequence identity of the alignment is still ranging from 40% to 99%. The testing data sets are still composed of one set of three-way alignments, one set of four-way alignments, and one set of five-way alignments. These alignments are also used as the input for Pfold. To test our method and compare it with Pfold, we evaluate the predictions by each method using the sensitivity and specificity of predicted base-pairs. Actually, we compute the sensitivity as the number of true positives divided by the sum of true positives + false negatives, and the specificity as the number of true positives divided by the sum of true positives + false positives. Table 2. Sensitivity and Specificity on Data Set of Three-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < 50 76.97 75.01 73.68 73.61 50 60 78.99 76.83 76.13 75.88 60 70 73.46 75.11 72.31 73.98 70 80 72.05 73.18 71.62 72.56 80 90 67.15 71.84 66.59 71.83 90 100 73.55 71.25 73.52 71.06 Total 73.70 73.87 72.31 73.15 Note: Id: sequence identity; Se: sensitivity; Sp: specificity. The results of our method are compared with those of Pfold, and are reported in Tables 2 4. For these tables, the second and third columns are the results of our method, followed by the results of Pfold. As shown in the tables, our method exhibits higher both sensitivity and specificity than Pfold, especially for alignments with lower sequence identity. One significant finding about the results is that our method exhibits much higher accuracy than Pfold when many gaps are introduced into the input alignment. Another valuable finding is that our method exhibits higher accuracy when more sequences are included. Actually, the best accuracy of our method (sensitivity 83.55%, specificity 81.68%) is achieved in tests for five-way alignments with 50 60% sequence identity. Table 3. Sensitivity and Specificity on Data Set of Four-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < 50 78.12 77.22 73.78 73.46 50 60 81.15 79.28 77.63 76.21 60 70 75.96 76.65 73.76 74.31 70 80 72.29 73.79 71.97 72.99 80 90 68.11 72.94 67.98 72.58 90 100 73.95 71.88 73.81 71.26 Total 74.93 75.29 73.16 73.47 Note: Id: sequence identity; Se: sensitivity; Sp: specificity. Table 4. Sensitivity and Specificity on Data Set of Five-Way Alignments Id (%) Se (%) Sp (%) Se Pfold (%) Sp Pfold (%) < 50 79.96 78.67 74.11 73.46 50 60 83.55 81.68 77.97 76.21 60 70 77.44 78.91 74.16 74.31 70 80 74.12 73.98 72.31 72.99 80 90 69.98 74.36 67.99 72.68 90 100 74.19 73.22 73.98 72.15 Total 76.54 76.80 73.42 73.63 Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 3.3 Comparison with the Model Without delete and insert Here, we provide a comparison of the model presented in Subsection 2.2 with a version of the model with the delete and insert states disabled for the alignment model. As for the model without delete and insert, we still use (2) to compute the probability of one single column from non-structural regions of the alignment. And we use the following equation to compute one column-pair from structural regions of the alignment: P (α i α i Tr, R) = P (αi 1 αi 1, α2 i αi 2,..., αn i αi n Tr, R) = a n+1,a n+2,...,a 2n 1 q 2n 1 2n 2 k=1 P (a k a k a F (k) a F (k), r k ). (13)

588 J. Comput. Sci. & Technol., July 2008, Vol.23, No.4 Here, α i α i still denote one of the column-pairs of D, and a k a k (1 k 2n 1) denote the base-pair at the node k. And P (a k a k a F (k) a F (k), r k ) denotes the probability of a k a k arising from a F (k) a F (k) over r k (1 k 2n 2). We still use (11) and (12) to compute the optimal secondary structure for the alignment D. But the P (ββ Tr, R) in (11) is computed by (13) and P (γ Tr, R) is computed by (2). We selected some five-way alignments containing more gaps from the testing data sets built in Subsection 3.2 and tested the model mentioned above on them. The results are compared with the full model described in Subsection 2.3, and are reported in Table 5. The second and third columns in the table are the results for the full method, followed by the results for the model without delete and insert. Table 5. Testing for the Full Model and the Model Without delete and insert on Five-Way Alignments Id (%) Se Full (%) Sp Full (%) Se (%) Sp (%) < 70 76.21 75.99 71.23 73.99 70 90 66.29 71.18 64.18 70.31 90 100 64.19 67.96 64.12 67.10 Total 68.90 71.71 66.51 70.47 Note: Id: sequence identity; Se: sensitivity; Sp: specificity. 4 Discussion In this paper, we devise, implement and evaluate a new SCFG-based method for predicting RNA secondary structure by introducing more complicated phylogenic analysis of homologous sequences. Our method improves the prediction by modeling RNA secondary structures using the newly defined profile SCFG and by performing phylogenic analysis of RNA sequences using more evolutionary information including deletion and insertion. The new profile SCFG presented here makes our method greatly differ from other methods. The phylogenic analysis of RNA sequences using two distinct HMMs makes our method robust. The fact that our method takes gapped RNA alignment as input makes it much suitable for practical application. Despite the limited amount of data, we have shown in the experiments that our method can predict RNA secondary structure with a better performance than Pfold. Our method exhibits higher accuracy than Pfold, especially for the alignments with more sequences, more gaps, lower sequence identity but higher structure identity. This should be the correct behavior, because our method takes much information about evolutionary process into account, while Pfold uses simplified evolutionary model to analyze the sequences in the alignment. On the other hand, both methods exhibit similar accuracy for alignments with much higher sequence identity (> 90%). This is true because little evolutionary information can be obtained from these alignments. As a result, our method loses some advantages relative to Pfold. In conclusion, our method can predict RNA secondary structure with better performance than Pfold. As for the comparison of the full model with the model without delete and insert (see Table 5), we conclude that the states delete and insert can help improve the prediction of RNA secondary structure in the case of more sequences and more gaps. And more benchmark should be performed for providing more precise conclusion. There are some differences between the method presented here and Pfold. First, two distinct HMMs, λ and λ, are proposed for performing phylogenic analysis of single column and column pairs of the RNA sequence alignment. Here, both λ and λ include the states match, delete and insert. But Pfold only has the state match (from the view of HMM) because it looks on the gap as the fifth character which is not from {A, C, G, U}. Hence, much evolutionary information such as deletion and insertion is omitted by Pfold. Second, the HMMs presented here respectively model evolutionary substitution processes of single column and column pairs. But Pfold does not take into account the difference between single column and column pairs. Third, we define and train a new profile SCFG to model RNA secondary structure. Especially, we present the concept of single column and column pairs. Finally, we merge the new SCFG and the two HMMs into a combined model to predict RNA secondary structure. Actually, the method presented here can be regarded as an improvement of Pfold. There are several ways in which our method could be improved. First, the fact that our method takes fixed alignment as input makes it vulnerable to alignment errors. Unfortunately, it is difficult to construct accurate alignment of RNA sequences without knowing the secondary structures. One way to solve this problem is to consider some structural information when aligning RNA sequences. We have presented a fast and practical method for detecting and assessing secondary structure in RNA alignment in [18]. In future work, we will apply it to constructing high-quality alignments, which can further be used to improve the prediction of RNA secondary structure. Second, our method takes a phylogenic tree relating the sequences as input. If the tree is not given, it must be estimated from the model. Estimating tree topology can be done by an

Xiao-Yong Fang et al.: RNA Secondary Structure Prediction By SCFGs 589 exhaustive search, a branch and bound method or a heuristic method. The choice will be highly dependent on the number of sequences in the alignment, considering the fast rate of growth in the number of trees with respect to the number of sequences. Third, one perhaps concerns about the time complexity of the method, especially for modeling the secondary structure using the new profile SCFG. To solve this problem, we can devise new parallel algorithms for modeling and predicting RNA secondary structures. But this needs further studying to maintain the accuracy of the method. Finally, we hope that our method will find use in studies on both biology and bioinformatics. References [1] Storz G. An expanding universe of noncoding RNAs. Science, 2002, 296(5571): 1260 1263. [2] Eddy S R. Non-coding RNA genes and modern RNA world. Nat. RevGenet, 2001, 2(12): 919 929. [3] Huttenhofer A, Schattner P, Polacek N. Non-coding RNAs: Hope or hype? TRENDS in Genetics, 2005, 21(5): 289 297. [4] Furtig B et al. NMR spectroscopy of RNA. Chembiochem, 2003, 4(10): 936 962. [5] Gardner P P, Giegerich G. A comprehensive comparison of comparative RNA structure prediction approaches. BMC Bioinformatics, 2004, 5: 140 157. [6] Zuker M, Stiegler P. Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucleic Acids Research, 1981, 9(1): 133 148. [7] Hofacker I, Fekete M, Stadler P. Secondary structure prediction for aligned RNA sequences. Journal of Molecular Biology, 2002, 319(5): 1059 1066. [8] Sakakibara Y et al. Stochastic context-free grammars for trna modeling. Nucleic Acids Research, 1994, 22(23): 5112 5120. [9] Eddy S R, Durbin R. RNA sequence analysis using covariance models. Nucleic Acids Research, 1994, 22(11): 2079 2088. [10] Dowell R, Eddy S. Evaluation of several lightweight stochastic context-free grammars for RNA secondary structure prediction. BMC Bioinformatics, 2004, 5: 71 84. [11] Dowell R, Eddy S. Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints. BMC Bioinformatics, 2006, 7: 400 417. [12] Knudsen B, Hein J. RNA secondary structure prediction using stochastic context-free grammars and evolutionary history. Bioinformatics, 1999, 15(6): 446 454. [13] Knudsen B, Hein J. Pfold: RNA secondary structure prediction using stochastic context-free grammars. Nucleic Acids Research, 2003, 31(13): 3423 3428. [14] Do C B, Woods D A, Batzoglou S. CONTRAfold: RNA secondary structure prediction without physics-based models. Bioinformatics, 2006, 22(14): e90 e98. [15] Durbin R et al. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids. Cambridge: Cambridge University Press, 1998, pp.233 297. [16] Pace N R, Thomas B C, Woese C R. Probing RNA Structure, Function, and History by Comparative Analysis. The RNA World, 2nd edition, NY: Cold Spring Harbor Laboratory Press, 1999, pp.113 141. [17] Sam G J, Alex B et al. Rfam: An RNA family database. Nucleic Acids Research, 2003, 31(1): 439 441. [18] Xiaoyong Fang et al. The detection and assessment of possible RNA secondary structure using multiple sequence alignment. In Proc. the 22nd Annual ACM Symposium on Applied Computing, Seoul, Korea, March 11 15, 2007, pp.133 137. Xiao-Yong Fang is currently a Ph.D. candidate in School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, computational biology, computational intelligence and highperformance computing. He has published several papers in international conferences such as the 22nd Annual ACM Symposium on Applied Computing and several papers in international journal such as Bioinformation. Zhi-Gang Luo is currently a professor of School of Computer Science, National University of Defense Technology, China. His research interests include bioinformatics, parallel algorithms, artificial intelligence, machine learning and high-performance computing in application domains of biomedical informatics. He has published over 40 journal or conference papers. He has served as a reviewer for over 5 international conferences. He is a program committee member of the Fifth International Workshop on Advanced Parallel Processing Technologies, Xiamen, China, 2005, and the Fifth International Conference on Grid and Cooperative Computing, Hunan, China, 2006. Zheng-Hua Wang is a professor of computer science at the National University of Defense Technology, China. His current research interests are computer system performance evaluation, parallel processing, and bioinformatics. He received the B.Eng., M.S. and Ph.D. degrees in mechanics from the National University of Defense Technology in 1983, 1988, and 1992 respectively.