ALL LIVING things consist of DNA, which is a very complex

Size: px

Start display at page:

Download "ALL LIVING things consist of DNA, which is a very complex"

Edgar Stevens
5 years ago
Views:

1 26 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes in Prokaryotic Genomes Mathukumalli Vidyasagar, Fellow, IEEE, Sharmila S. Mande, Ch. V. Siva Kumar Reddy, and V. V. Raja Rao Abstract In this paper, we present a new algorithm called 4M (mixed memory Markov model) for finding genes from the genomes of prokaryotes. This is achieved by modeling the known coding regions of the genome as a set of sample paths of a multistep Markov chain (call it ) and the known non-coding regions as a set of sample paths of another multistep Markov chain (call it ). The new feature of the 4M algorithm is that different states are allowed to have different memory lengths, in contrast to a fixed multistep Markov model used in GeneMark in its various versions. At the same time, compared with an algorithm like Glimmer3 that uses an interpolation of Markov models of different memory lengths, the statistical significance of the conclusions drawn from the 4M algorithm is quite easy to quantify. Thus, when a whole genome annotation is carried out and several new genes are predicted, it is extremely easy to rank these predictions in terms of the confidence one has in the predictions. The basis of the 4M algorithm is a simple rank condition satisfied by the matrix of frequencies associated with a Markov chain. The 4M algorithm is validated by applying it to 75 organisms belonging to practically all known families of bacteria and archae. The performance of the 4M algorithm is compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found that, in a vast majority of cases, the 4M algorithm finds many more genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome annotation of 13 organisms by using 50% of the known genes as the training input for the coding model and 20% of the known non-genes as the training input for the non-coding model. After this, all of the open reading frames are classified. It is found that the 4M algorithm is highly specific in that it picks out virtually all of the known genes, while predicting that only a small number of the open reading frames whose status is unknown are genes. Index Terms Algorithm, gene prediction, K L divergence, Markov model, prokaryotes. A. Gene-Finding Problem I. INTRODUCTION ALL LIVING things consist of DNA, which is a very complex molecule arranged in a double helix. DNA consists of a series of nucleotides, where each nucleotide is denoted by the base it contains, namely, A (Adenine), C (Cytosine), G (Guanine), or T (Thymine). The genome of an organism is the listing of one strand of DNA as an enormously long sequence of symbols from the four-symbol alphabet {A, C, G, T}. Certain Manuscript received January 22, 2007; revised September 1, The authors are with the Advanced Technology Centre, Tata Consultancy Services, Software Units Layout, Madhapur, Hyderabad , India ( sagar@atc.tcs.com; sharmila@atc.tcs.com; siva@atc.tcs.com; raja@atc.tcs.com). Digital Object Identifier /TAC parts of the genome correspond to genes that get converted into proteins, while the rest are non-coding regions. In prokaryotes, or lower organisms, the genes are in one continuous stretch, whereas in eukaryotes, or higher organisms, the genes consist of a series of exons, interruped by introns. The junctions between exons and introns are called splice sites, and the detection of splice sites is a very difficult problem. For this reason, the focus in this paper is on finding genes in prokaryotes. It is easy to state some necessary but not sufficient conditions for a stretch of genome to be a gene, which are given as follows. The sequence must begin with the start codon ATG. In some organisms, GTG is also a start codon. The sequence must end with one of the three stop codons, namely, TAA, TAG, or TGA. The length of the sequence must be an exact multiple of three. A stretch of genome that satisfies these conditions is referred to as an open reading frame (ORF). B. Statistical Approaches to Gene-Finding There are in essence two distinct approaches to gene-finding, namely, string-matching and statistical modeling. In string-matching algorithms, one looks for symbol-for-symbol matching, whereas in statistical modeling one looks for similarity of the statistical behavior. If one were to examine genes with the same function across two organisms, then it is likely that the two DNA sequences would match quite well at a symbol-for-symbol level. For instance, if one were to compare the gene that generates insulin in a mouse and in a human, the two strings would be very similar, except that occasionally one would have to introduce a gap in one sequence or the other. This particular problem, namely to determine the best possible match between two sequences, after inserting a few gaps here and there, is known as the optimal gapped alignment problem. If and are the two strings to be aligned and have lengths and, respectively, then it is possible to give an optimal alignment based on dynamic programming, whose complexity is. Parallel implementations of this alignment algorithm are also possible. For further details, see [7] and [5]. On the other hand, if one were to examine two different genes with distinct functionalities but from within the same organism or within the same family of organisms, then the two genes would not be similar at a symbol-for-symbol level. However, it is widely believed that they would match at a statistical level. The idea behind statistical prediction of genes can be summarized as follows. Suppose we are given several known strings of genes and several known strings of non-genes / 2008 IEEE

2 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 27. We think of the s as sample paths of one stochastic process generated by a coding model and of the s as sample paths of another stochastic process generated by a non-coding model. Now suppose is an ORF, and we wish to classify it as being a gene or a non-gene. The logical approach is to use log-likelihood ratio classification. Thus, we compute the probabilities (or likelihoods) and, that is, the likelihood of the string according to the coding model and the non-coding model, respectively. If, then we classify as a gene, whereas if, then we classify as a non-gene. In the unlikely event that both likelihoods are comparable, the method is inconclusive. The existing statistical gene prediction methods differ only in the manner in which the coding model and the non-coding model are constructed. In Genescan [18], the basic premise is that, in coding regions, the four nucleic acid symbols occur roughly with a period of three. Thus, there is no non-coding model as such. Instead, a discrete Fourier transform is taken of the occurrence of each nucleotide symbol, and the value is compared against a threshold. In Genscan [2], [3], the sample paths are viewed as outputs of a hidden Markov model that can also emit a blank symbol. Because of this, the length of the observed string does not match the length of the state sequence, thus leading to an extremely complicated dynamic programming problem. In GeneMark [12], the observed sequences are interpreted as the outputs of a multistep Markov process. Probably the most widely used classification method is Glimmer, which has several refinements to suit specific requirements. Some references to Glimmer can be found in [4] and [17]. We shall return to Glimmer and GeneMark again in Section V, when we present our computational results and compare them against those produced by Glimmer. C. Contributions of This Paper Glimmer uses an interpolated Markov model (IMM) whereby the sample paths are fit with multistep Markov models whose memory varies from 0 to 8. (Note that a Markov process with zero memory is an i.i.d. process.) This requires the estimation of a fairly large number of parameters. To overcome this difficulty, Glimmer uses high-order Markov models only when there is sufficient data to get a reliable estimate. Initially, GeneMark used a fifth-order Markov chain, but subsequent versions use refined versions, including a hidden Markov model (HMM). In contrast, the premise of our study is that, even in a multistep Markov process, different states have different effective memory. This leads to a mixed memory Markov model. Hence, the algorithm is called 4M. We begin by fitting the sample paths of both the coding regions and the non-coding regions with a fifth-order Markov model each. The reason for using a fifth-order model (thus exactly replicating hexamer frequencies) is that lower order models result in noticeably poorer performance, whereas higher order models do not seem to improve the performance very much. Using two fifth-order Markov models for the coding and non-coding regions results in two models having states each. Then, by using a simple singular value condition, many of these states are combined into one common state. In some cases, the resulting reduced-size Markov model has as few as 150 states which is an 85% reduction! In addition to the singular value test to choose the level of reduction permissible, we also use the Kullback Leibler (K-L) divergence rate [9] to bound the error introduced by reducing the size of the state space. This upper bound on the K-L divergence rate can be used to choose a threshold parameter in the rank condition in an intelligent fashion. In addition, the K-L divergence rate is also used to demonstrate that the three-periodicity effect is very pronounced in coding regions but not in the non-coding regions. The statistical significance of the 4M algorithm is rather easy to analyze. As a result, when some ORFs are predicted to be genes using the 4M algorithm, our confidence in the prediction can be readily quantified, and the predictions can be ranked in order of decreasing confidence. In this way, the most confident predictions (if they are also interesting from a biological standpoint) can be followed up for experimental verification. All in all, the conclusion is that the 4M algorithm performs comparably well or somewhat better than Glimmer and Gene- Mark; however, in the case of the 4M algorithm, it is quite easy and straightforward to compute the statistical significance of the conclusions drawn. II. 4M ALGORITHM A. Multistep Markov Models Recall that the statistical approach to gene-finding depends on being able to construct two distinct models, for the coding regions and for the non-coding regions. From purely a mathematical standpoint, we have only one problem at hand, namely, given a set of sample paths of a stationary stochastic process, construct a model for these paths. In other words, both and are constructed using exactly the same methodology, but applied to distinct sets of sample paths. Thus, let us concentrate on this problem formulation. Suppose is a positive integer, and define. (In the case of genomics,.) Suppose is a stationary stochastic process assuming values in, and we have at hand several sample paths of this process. The objective is to construct a stochastic model for the process on the basis of these observations. Suppose an integer is specified, and we know the statistics of the process up to order. This means that the probabilities of occurrence of all -tuples are specified for the process. Let denote the probability of occurence of the string. Thus, if is a string of length, say, then Since the process is stationary, the above probability is independent of. Note that the frequencies must satisfy a set of consistency conditions as follows: There is a well-known procedure that perfectly reproduces the specified statistics by modeling the given

3 28 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 process as a -step Markov process. For brevity, let us use the notation to mean, and so on. Assuming that the process is a -step Markov process means that, if is a string of length larger than, then In short, it is assumed that the probability is not affected by values of when. Moreover, the transition probability of this multistep Markov process is computed as B. 4M Algorithm Here, we introduce the basic idea behind the 4M algorithm and then present the algorithm itself. We begin with a multistep Markov model and then reduce the size of the state space further by using a criterion for determining whether some states are Markovian. The basis for the 4M algorithm is a simple property of Markov processes. Consider a Markov chain evolving over a finite alphabet. Let and consider the frequency of the triplet. Clearly, for any process (Markovian or not), we have Note that in the above formula we simplify notation by writing The above model, though it is often called a -step Markov model, is also a traditional (one-step) Markov model over the larger state space. Suppose are two states. Then a transition from to is possible only if the last symbols of (read from left to right) are the same as the first symbols of ; in other words, it must be the case that and so on. Now, if the process is Markovian, then we have Hence, if we examine the matrix for some In this case, the probability of transition from the state state is given by to the (1) it will have rank one. This is because, with. fixed, it looks like For all other, the transition probability equals zero. It is clear that, though the state transition matrix has dimension, every row contains at most nonzero entries. Such a -step Markov model perfectly reproduces the -tuple frequencies for all. Given a long string of length, we can write.. There is nothing special about using only a single symbol. Suppose is a string of finite length, denoted as usual by. Then, just as above, we have that Thus, if we fix an integer and examine the matrix As a result, Now, in the above summation, all numbers are of reasonable size. To round out the discussion, suppose that the statistics of the process are not known precisely, but need to be inferred on the basis of observing a set of sample paths. This is exactly the problem we have at hand. In this case, one can still apply (1), but with the actual (but unknown) probabilities and replaced by their empirically observed frequencies. Each of these gives an unbiased estimate of the corresponding probability. then has rank one. Conversely, if the semi-infinite matrix has rank one for every, then the process is Markovian. Now suppose we drop the assumption that the process is Markovian, and suppose the matrix has rank one for a particular fixed. An elementary exercise in linear algebra shows that, in such a case, we must have This follows by reversing the above reasoning. Accordingly, let us define a state to be Markovian of order if has rank one. The distinction here is that we are now speaking about an individual state being Markovian as opposed to the entire process. (2)

4 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 29 The above rank one property and definition can also be extended to multistep Markov processes. Suppose is an -step Markov process. This means that, for each fixed, we have Now, if a substring of the string is Markovian, say, then we can take advantage of (3) and do the substitution Hence, for each fixed, the matrix has rank one. Following earlier reasoning, we can define a state to be a Markovian state of order if the matrix has rank one. Let us now consolidate all of this discussion and apply it to reducing the size of the state space of a multi-step Markov process. Suppose, as before, that is a stationary stochastic process assuming values in a finite set. We have already seen that, in order to reproduce perfectly the -tuple frequencies, it suffices to construct a -step Markov model. Suppose that such a multi-step model has indeed been constructed. This model makes use of the -tuple frequencies. Now suppose that, for some integer and some string, it is the case that the matrix has rank one. By elementary linear algebra, this implies that In other words, the conditional probability of finding a symbol at a particular location depends only on the preceding symbols and not on the symbols that precede. Hence, if has rank one, then we can collapse all states of the form for all into a single state. For this reason, we call a Markovian state if has rank one. The interpretation of being a Markovian state is that, when this string occurs, the process has a memory of only time steps and not in general. To implement the reduction in state space, we therefore proceed as follows. Step 1) Compute the vector of -tuple frequencies. Step 2) Set and, for each, compute the matrix. If has rank one, then collapse all states of the form for all into a single state. Repeat this test for all. Step 3) Increase the value of by one and repeat until. When the search process is complete, the initial set of states will have been collapsed into some intermediate number, whose value depends on the -tuple frequencies. Since we are modeling the process as a -step Markov process, in general, for a string of length, we can write (3) (4) in the above formula. This is the reason for calling the algorithm a mixed memory Markov model since different -tuples have memories of different lengths. The preceding theory is exact provided we use true probabilities in the various computations. However, in setting up the multistep Markov model, we are using empirically observed frequencies and not true probabilities. Hence, it is extremely unlikely that any matrix will exactly have rank one. At this point, we take advantage of the fact that we wish to do classification and not modeling. This means that, in constructing the coding model and the non-coding model, it is really not necessary to get the likelihoods exactly right it is sufficient for them to be of the right order of magnitude. Hence, in implementing the 4M algorithm, we take a matrix as having effective rank one if it satisfies the condition where and denote the largest two singular values of the matrix, and is an adjustable threshold parameter. We point out in Section VI that, by using the K-L divergence rate between Markov models, it is possible to choose the threshold intelligently. Setting a state to be a Markovian state even if is not exactly a rank one matrix is equivalent to making the approximation In other words, in the original -step Markov model, the entries in the rows corresponding to all states of the form are modified according to (6). III. K-L DIVERGENCE RATE Here, we introduce the notion of the K-L divergence rate between stochastic processes and its applications to Markov chains. Then, we derive an expression for the K-L divergence rate between the original -step Markov model and the 4M-reduced model. This formula is of interest because the two processes have a common output space, but not a common state space. These results are applied to some problems in genomics in Section IV. A. K-L Divergence Let be an integer, and let denote the -simplex, namely, (5) (6) Thus, is just the set of probability distributions for an -valued random variable. Suppose are two such

5 30 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 probability distributions. Then the K-L divergence between the two vectors is defined as Note that, in order for to be finite, needs to be dominated by, that is,. We write or to denote that is dominated by or that dominates. Here, we adopt the usual convention that. The K-L divergence has several possible interpretations, of which only one is given here. Suppose we are given data generated by an i.i.d. sequence whose one-dimensional marginal distribution is. There are two competing hypotheses, namely, that the probability distribution is and that the probability distribution is, neither of which may be the truth. If we observe a sequence, where is the length of the observation and each has one of the possible values, we compute the likelihood of the observation under each of the two hypotheses and choose the more likely one, that is, the hypothesis that is more compatible with the observed data. In this case, it is easy to show that the expected value of the log-likelihood ratio is precisely equal to. Thus, in the long run, we will choose the hypothesis if and the hypothesis if. In other words, in the long run, we will choose the hypothesis that is closer to the truth. Therefore, even though the K-L divergence is not truly a distance (it does not satisfy either the symmetry property or the triangle inequality), it does induce a partial ordering on the set. The difference is the per-symbol contribution to the log-likelihood ratio. As a parenthetical aside, see [11] for a very general discussion of divergence generated by an arbitrary convex function. For an appropriate choice of the convex function, the corresponding divergence will in fact satisfy the one-sided triangle inequality. However, the popular choice is not one such function. B. K-L Divergence Rate The traditional K-L divergence measure is perfectly fine when the classification problem involves a sequence of independent observations. However, in trying to model a stochastic process via observing it, it is not always natural to assume that the observations are independent. It is therefore desirable to have a generalization of the K-L divergence to the case where the samples may be dependent. Such a generalization is given by the K-L divergence rate. Suppose is some set, and is a stochastic process assuming values in the set. Thus, the stochastic process itself assumes values in the infinite Cartesian product space. Suppose are two probability laws, that is, probability measures on the product space. In principle, we could define the K-L divergence between the two laws by extending the standard definition, using Radon Nikodym derivatives and so on. The trouble is that most of the time the divergence would be infinite and conveys no useful information. Thus, blindly computing the divergence between the two laws of a stochastic process gives no useful information most of the time. (7) To get around this difficulty, it is better to use the K-L divergence rate. It appears that the K-L divergence rate was introduced in [9]. If and are two probability laws on and if is a finite set, we define where and are the marginal distributions on and, respectively, onto the -dimensional product, and is just the conventional K-L divergence (without the rate). The idea is that, in many cases, the pure K-L divergence approaches infinity as. However, dividing by moderates the rate of growth. Moreover, if the ratio has a finite limit as, then the K-L divergence rate gives a measure of the asymptotic rate at which the pure divergence blows up as. The K-L divergence rate has essentially the same interpretation as the K-L divergence. Suppose we are observing a stochastic process whose law is. We are trying to decide between two competing hypotheses: The process has the law, and the process has the law. After samples, the expected value of the log-likelihood ratio is asymptotically equal to. The paper [16] gives a good historical overview of the properties of the K-L divergence. Specifically, in general the K-L divergence may not exist between arbitrary probability measures, but it seems to exist under many reasonable conditions. For example, it is known [6] that, if is a stationary law and is the law of a finite-state Markov process, then the K-L divergence rate is well defined. It is shown in [14] that the K-L divergence rate exists if both laws correspond to ergodic processes. C. K-L Divergence Rate Between Markov Processes In [16], an explicit formula is given for the K-L divergence rate between two Markov processes over a common (finite) state space. We give an alternate version of the formula derived in [16], which generalizes very cleanly to multistep Markov processes. Suppose and are the laws of two Markov processes over a finite set. Thus, are stochastic matrices and are corresponding stationary vectors. Thus,.If is the law of the Markov process and is the law of the Markov process, then it is shown in [16] that where denote the th rows of the matrics and, respectively. In order for the divergence rate to be finite, the two state transition matrices and must satisfy the condition or, in the earlier notation, we must have. We denote this condition by or. (8) (9)

6 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 31 Now we give an alternate formulation of (9) that is in some sense a little more intuitive. Theorem 1: Suppose are stochastic matrices, and let denote associated stationary probability distributions. Thus, and. Let denote the law of the Markov process and let denote the law of the Markov process. Let denote the frequency vector of doublets, under the Markov chain. Similarly let denote the frequency vector of doublets, under the Markov chain. Suppose. Then, the K-L divergence rate between the Markov chains is given by (10) where is the conventional K-L divergence between probability vectors. 1) Remarks: Formula (10) gives a nice interpretation of the K-L divergence rate between Markov chains: it is just the difference between the divergence of the doublet frequencies and the divergence of the singlet frequencies. Moreover, it is easy to extend it to -step Markov chains. The K-L divergence rate is just the difference between the divergence of -tuple frequencies and -tuple frequencies. The proof is omitted in the interests of brevity. It can be found in [21]. In [16], the authors do not give an explicit formula for the K-L divergence rate between multistep Markov chains. There is an analogous formula to (10) in the case of -step Markov models. Define to be the frequency vector of -tuples for the first Markov chain, and define the symbols in the obvious fashion. Then (11) The proof, based on the fact that an -step Markov model is just a one-step Markov model on, is easy and is left to the reader. D. K-L Divergence Rate When the 4M Algorithm is Used Theorem 1 gives the K-L divergence rate between two Markov processes over the same state space. In this paper, we begin with a -step Markov process and approximate it by some other Markov process by applying the 4M algorithm. When we do so, the resulting processes no longer share a common state space. Thus, Theorem 1 no longer applies. The next theorem gives a formula for the K-L divergence rate when the 4M algorithm is used to achieve this reduction. Note that the problem of computing the K-L divergence rate between two entirely arbitrary HMMs with a common output space is still an open problem. Theorem 2: Suppose is a stationary stochastic process and that the frequencies of all -tuples are specified. Let denote the approximation of by a -step Markov process. Suppose now that we apply the 4M algorithm and choose various tuples as Markovian states. Let denote the Markovian states and let denote the length of the Markovian state. Finally, let denote the resulting stochastic process. Then, the K-L divergence rate between the original -step Markov model and the 4M reduced Markov model (or equivalently between the laws of the process and ) is given by (12) Proof: Note that the full -order Markov model has exactly nonzero entries in each row labeled by, and these entries are as varies over. One can think of the reduced-order model obtained by the 4M algorithm as containing the same rows, except that, if is a Markovian state, then the entry in all rows of the form are changed from to. The vector is a stationary distribution of the original -order Markov model. Now (12) readily follows from (11). Note that, in applying the 4M algorithm, we approximate the ratio by the ratio for each string that is deemed to be a Markovian state. Hence, the quantity inside the logarithm in (12) should be quite close to one, and its logarithm should be close to zero. IV. COMPUTATIONAL RESULTS I: APPLICATIONS OF THE K-L DIVERGENCE RATE This section contains the first set of computational results. Here, we study the three-periodicity of coding regions using the K-L divergence rate. The same K-L divergence rate is also used to show that there is virtually no three-periodicity effect in the non-coding regions. Then, we analyze the effect of reducing the size of the state space using the 4M algorithm in terms of the generalization error. A. List of Organisms Analyzed The 4M algorithm was applied to 75 prokaryote genomes of microbial organisms. These genomes comprised both bacteria as well as archae. To save space, in the tables showing the computational results we give the names of the various organisms in a highly abbreviated form. Table V gives a list of all the organisms for which the computational results are presented here, together with the abbreviations used. B. Three-Periodicity of Coding Regions There is an important feature that needs to be built into any stochastic model of genome sequences. It has been observed that, if one were to treat the coding regions as sample paths of a stationary Markov process, then the results are pretty poor. The reason is that genomic sequences exhibit a pronounced threeperiodicity. This means that the conditional probability is not independent of, but is instead periodic with a period of three. Thus, instead of constructing one -step Markov model, we must in fact construct three such models. These are referred to as the Frame 0, Frame 1, and Frame 2 models. We begin from the start of the genome and label the first nucleotide as Frame 0, the second nucleotide as Frame 1, the third

7 32 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 TABLE I DIVERGENCES BETWEEN MARKOV MODELS OF CODING REGIONS TABLE II DIVERGENCES BETWEEN MARKOV MODELS OF NON-CODING REGIONS nucleotide as Frame 2, and loop back to label the fourth nucleotide as Frame 0, and so on. Then the 4M reduction using the rank condition is applied to each frame. Since a three-periodic Markov chain over a state space can also be written as a stationary Markov chain over the state space, there are no conceptual difficulties because of three-periodicity. In this subsection, we use the K-L divergence rate introduced in Section III to assess the significance of three-periodicity in both coding and non-coding regions in various organisms. The study is carried out as follows. In 13 organisms, we constructed three-periodic models for the known coding regions as well as known non-coding regions, using the value. This means that for each organism we constructed six different fifth-order Markov models that perfectly reproduced the observed hexamer frequencies. These models are denoted by respectively (three coding-region models and three non-coding region models). For the three coding region models, we computed six different divergence rates, namely for all. Then, we did the same for the three non-coding region models. (Remember that the K-L divergence rate is not symmetric.) Tables II and III show these six divergence rates for 13 of the organisms listed in Section IV-A, for both coding regions as well as non-coding regions. Actually, we computed 75 such divergences, but only 13 are presented here. To make the tables fit within the two-column format, we use the obvious notation to denote or as appropriate. Note that throughout the base of the logarithm used in (12) is 2. Thus, all logarithms are binary logarithms. From Tables I and II, it is clear that the three-periodicity effect in the non-coding regions is noticeably less than in the coding regions, in the sense that the K-L divergence rates between the three frames of the non-coding models are essentially negligible, compared with the corresponding divergence rates in the coding regions. In fact, except for Mycoplasma genitalium, the divergences in the non-coding regions are essentially negligible. M. genitalium is a peculiar organism, in which the codon TGA codes for the amino acid tryptophane, instead of being a stop codon as it is in practically all other organisms. Thus, when we construct algorithms for predicting genes, we would be justified in ignoring the three-periodicity effect in the non-coding regions. TABLE III SIZES OF 4M-REDUCED MARKOV MODELS C. Reduction in Size of State Space Here, we study the reduction in the size of the state space when the 4M algorithm is used. In applying the 4M algorithm, we used the value. It has been verified numerically that smaller values of do not give good predictions, while larger values of do not lead to any improvement in performance. Thus, seems to be the right value. Hence, for each organism, we constructed three coding region models, and one non-coding region model, each model being fifth-order Markovian. Recall that, due to the three-periodicity of the coding regions, we need three models, one for each frame in the coding region. Since the non-coding region does not show so much of a three-periodicity effect (as demonstrated in the preceding subsection), we ignore that possibility and construct just one model for the non-coding region. Each of the fifth-order Markovian models has states, consisting of pentamers of nucleotides. Then, for each of these four models, we applied the 4M reduction, with the threshold in (5) set somewhat arbitrarily at. A more systematic way to choose is given in Section VI. Recall that the larger the threshold, the larger the number of states that will satisfy the near rank one condition (5), and the greater the reduction in the size of the state space. The CPU time for computing the hexamer frequencies of a genome with about one million base pairs is approximately 10 s on a Intel Pentium IV processor running at 2.8 GHz, while the state space reduction takes just 0.3 s or about 3% of the time

8 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 33 TABLE IV RESULTS OF WHOLE GENOME ANNOTATION USING THE 4M ALGORITHM USING 50% TRAINING DATA TABLE V ABBREVIATIONS OF ORGANISM NAMES AND COMPARISON OF 4M VERSUS GLIMMER 3 needed to compute the frequencies. Thus, once the hexamer frequencies are constructed, the extra effort needed to apply the 4M algorithm is negligible. Table III shows the size of the 4M-reduced state space for each of the 13 organisms studied. All of the numbers in the table should be compared with 1024, which is the number of states of the full fifth-order Markov chain. Moreover, in Glimmer and its variants, one uses up to eighth-order Markov models for certain organisms, meaning that in the worst case the size of the state space could be as high as. From this table, it is clear that in most cases the 4M algorithm leads to fairly significant reduction in the size of the state spaces. There are some dramatic reductions, such as in the case of B. sub, for which the reduction in the size of the state space is of the order of 85%. Moreover, in almost all cases, the size of the state space is reduced by at least 50%. V. COMPUTATIONAL RESULTS II: GENE PREDICTION Now we come to the main topic of this paper, namely, finding genes. One can identify two distinct philosophies in gene-prediction algorithms. Some algorithms, including the one presented here, can be described as bootstrapping. Thus, we begin with some known genes, construct a stochastic model based on those, and then use that model to classify the remaining ORFs as potential genes. The most promising predictions are then validated either through experiment or through comparison with known genes of other similar organisms. The validated genes are added to the training sample and the process is repeated. This is why the process may be called bootstrapping. In contrast, Glimmer (in its several variants), which is among the most popular and most accurate prediction algorithm at present, can be described as an ab initio scheme. In Glimmer, all ORFs longer than 500 base pairs are used as the training set for the coding regions. The premise is that almost all of these are likely to be genes anyway. In principle, we could apply the 4M algorithm with the same initial training set and the results would not be too different from those presented here. For a gemone with one million base pairs, the 4M algorithm required approximately 10 s of CPU time and approximately 5 Mb of storage for training the coding and non-coding models, compared with 10 s of CPU time and 50 Mb of storage for Glimmer3. The prediction problem took about 60 s of CPU time and 20 Mb of storage for 4M versus 13 s of CPU time and 4 Mb of storage for Glimmer3. Our implementation of the 4M algorithm was done using Python, which is very efficient for the

9 34 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 TABLE VI COMPARISON OF 4M ALGORITHM VERSUS GENEMARK 2.5D AND GENEMARKHMM2.6g programmer but very inefficient in terms of CPU time Thus, we believe that there is considerable scope for reduction in both the CPU time as well as the storage requirements when implementing the 4M algorithm.

10 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 35 A. Classification of Annotated Genes Using 4M and Other Methods: Comparison of Results Here, we take the database of annotated genes for each of 75 organisms and classify them using 4M, Glimmer3, Gene- Mark2.5d, and GeneMarkHMM2.6g. The database of annotated genes represents a kind of consensus. Some of the genes in the database are experimentally validated, while others are sufficiently similar at a symbol for symbol level to other known genes that these too are believed to be genes. Thus, it is essential that any algorithm should pick up most if not all of these annotated genes. The test was conducted as follows. To construct the coding and non-coding models for the 4M algorithm, we took some known genes to train the coding model and some known noncoding regions to train the non-coding model. The fraction of known genes used to train the coding model was 50%, that is, we used every other gene. For the non-coding model, we picked around 20% of the known non-coding regions at random. Throughout, we used a three-periodic model for the coding regions and a uniform (i.e., nonperiodic) model for the non-coding regions. These models were then 4M-reduced using the threshold. Then the remaining (known) coding and non-coding regions were classified using the log-likelihood method. In the tables, we have used the following notation. Total Genes denotes the total number of genes in the annotated database. 4M & Gl denotes the genes picked up by both the 4M algorithm and Glimmer3. & denotes the genes missed by both the algorithms. 4M & denotes the genes picked up by the 4M algorithm but missed by Glimmer3. &Gldenotes the genes missed by the 4M algorithm but picked up by Glimmer3. Similar notation is used in Table VI, with Glimmer3 replaced by GeneMark2.5d (denoted by GMK) and GeneMarkHMM (denoted by GHMM). First, we compare the performance of the 4M algorithm against that of Glimmer3, as detailed in Table V. It is worth pointing out that, in the results presented here, there is no postprocessing of the raw output of the 4M algorithm, as is common with other algorithms. The key points of comparison are the numbers in the next-to-last and last columns. From this table, it can be seen that, except in the case of seven organisms (B. jap, C. vio, G. vio, the three Pseudomonas family, and R. etl), 4M finds at least as many genes as it misses, compared with Glimmer3. On the other side, in 32 organisms, the number of annotated genes found by 4M and missed by Glimmer3 is more than double the number missed by 4M and found by Glimmer3. Next, a glance at Table VI reveals that 4M overwhelmingly outperforms GeneMark2.5d, which is an older algorithm based on modeling the genes using a fifth-order Markov model. Since 4M is also based on a fifth-order Markov model but with some reduction in the size of the state space, the vastly superior performance of 4M is intriguing to say the least. Compared with Gen- TABLE VII COMPARISON OF 4M ALGORITHM VERSUS GLIMMER3 ON SHORT GENES emarkhmm2.6g, the superiority of 4M is not so pronounced; nevertheless, 4M has the better performance. To summarize, 4M somewhat outperforms Glimmer3 and GeneMarkHMM2.6g in most cases, and considerably outperforms GeneMark2.5d. Finally, we compared the performance of the 4M algorithm with Glimmer3 on short genes. It is widely accepted that long genes are easy to find using just about any algorithm and that the real test of an algorithm is its ability to find short genes. Since 4M significantly outperforms both versions of GeneMark in any case, we present only the comparison of 4M against Glimmer3 on three sets of genes: those of length less than 150 base pairs, between 151 and 300 base pairs, and between 301 and 500 base pairs. In presenting the results, we omitted any organism where the number of ultrashort genes of length less than 150 base pairs was less than 20. These results are found in Table VII. From this table, it is clear that 4M vastly outperforms Glimmer3 in predicting ultrashort genes and is somewhat superior in finding short genes. In the case of the organism M. genitalium, which has the exceptional property that the codon TGA codes for the amino acid tryptophane instead of being a stop codon, the 4M algorithm performs poorly when this fact is not incorporated. However, when this fact is incorporated, the performance of 4M improves dramatically. This is why there are two rows corresponding to M. genitalium: the first row is with assuming that TGA is a stop codon, and the second assumes that TGA is not a stop codon. We were at first rather startled by the extremely poor performance

11 36 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 of the 4M algorithm in the case of M. genitalium, considering that the algorithm performed so well on the rest of the organisms. This caused us to investigate the organism further and led us to discover from the literature that, in fact, M. genitalium has a nonstandard genetic code. The moral of the story is that, by purely statistical analysis, we could find out that there was something unusual about this organism. More interestingly, even in the case of the extremely wellstudied organism E. coli, neither the 4M algorithm nor Glimmer 3 performs particularly well. This kind of poor performance is usually indicative of some nonstandard behavior on the part of the organism, as in the case of M. genitalium. This issue needs to be studied further. B. Whole Genome Annotation Using the 4M Algorithm Here, we carry out whole genome annotation of 13 organisms using the 4M algorithm. First, we identify all of the ORFs in the entire genome. Then, we train the model using every other gene in the database of annotated genes and about 20% of the known noncoding regions to train. Both models are 4M-reduced. Then, the entire set of ORFs are classified using the log-likelihood classification scheme. The legend for the column headings in Table IV is as follows: organism, the number of ORFs, the number of annotated genes, the number of annotated genes that are picked up by 4M and predicted to be genes, the number of other ORFs whose current status is unknown, and finally, the number of these other ORFs that are predicted by 4M to be genes. To save space, we present results for only 13 organisms. Now let us discuss the results in the Table IV. It is reasonable to expect that, since half of the annotated genes are used as the training set, the other half of the annotated genes are picked up by 4M as being genes. However, what is surprising is how few of the other ORFs whose status is unknown are predicted to be genes. Actually, for each organism, the 4M algorithm predicts several hundred ORFs to be additional genes. However, since there are many overlaps amongst these ORFs, we eliminate the overlaps to predict a single gene for each set of overlapping regions. Thus, there is a clear differentiation between the statistical properties of the annotated genes and the ORFs whose status is unknown, and the 4M algorithm is able to differentiate between them. These additional predicted genes are good candidates for experimental verification. In order to prioritize them, they can be ranked in terms of the normalized log-likelihood ratio where denotes the length of the ORF. The reason for normalizing the log-likelihood ratio is that, as becomes large, the raw log-likelihood ratio will also become large. Thus, comparing the raw log-likelihood ratio of two ORFs is not meaningful. However, comparing the normalized log-likelihood ratio allows us to identify the predictions about which we are most confident. This is the advantage of having a stochastic modeling methodology whose significance is easy to analyze. VI. CONCLUSION AND FUTURE WORK We have studied the problem of finding genes from prokaryotic genomes using stochastic modeling techniques. The genefinding problem is formulated as one of classifying a given sequence of bases using two distinct Markovian models for the coding regions and the non-coding regions, respectively. For the coding region, we construct a three-periodic fifth-order Markovian model, whereas for the non-coding we construct a fifthorder Markovian model (ignoring the three-periodicity effect). Then, we introduced a new method known as 4M that allows us to assign variable length memories to each symbol, thus permitting a substantial reduction in the size of the state space of the Markovian model. The disparities between various models have been quantified using the K-L divergence rate between Markov processes. The K-L divergence rate has a number of useful applications, some of which are brought out in this paper. For instance, using this measure, it has been conclusively demonstrated that the threeperiodicity effect is much more pronounced in coding regions than in non-coding regions. This is why we could ignore threeperiodicity in non-coding regions. An explicit formula has been given for the K-L divergence rate between a fifth-order Markov model and the 4M-reduced model. This formula allows us to quantify the classification error resulting from this model order reduction. Using this new algorithm, we annotated 75 different microbial genomes from several classes of bacteria and archae. The performance of the 4M algorithm was then compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It has been shown that the 4M algorithm somewhat outperforms Glimmer3 and considerably outperforms both versions of GeneMark. When it comes to finding ultrashort and short genes, the 4M algorithm significantly outperforms even Glimmer3. We also carried out whole genome annotations of all ORFS in several organisms using the 4M algorithm. We found that, while the 4M algorithm detects an overwhelming majority ( ) of annotated genes as genes, it picks up a surprisingly small fraction of the remaining ORFs as genes. Thus, the 4M algorithm is able to differentiate very clearly between the known genes and the unknown ORFs. Moreover, since the 4M algorithm uses a simple log-likelihood test, it is possible to rank all the predicted genes in terms of decreasing log-likelihood ratio. In this way, the most confident predictions can be tried out first. Formula (12) can be used to choose the threshold in (5) in an adaptive manner. Let denote the laws of the full fifth-order Markov models for the coding regions and the non-coding regions, respectively, and let denote the law of the 4M-reduced model, obtained using a threshold. We should choose to be as large as possible while maintaining the constraint where is a new adjustable parameter. If we choose a very small value of, say, then the log-likelihood ratio between the coding and non-coding models will hardly be affected, if

VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 37 is replaced by. This will be a more intelligent and adaptive way to choose. ACKNOWLEDGMENT The authors would like to thank B.

Cambridge, MA: MIT Press, 2001. [2] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic DNA, J. Molec. Biol., vol. 268, pp. 78 94, 1997. [3] C. Burge and S. Karlin, Finding genes in genomic DNA, Curr.

23, pp. 4636 4641, 1999. [5] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics, 2nd ed. New York: Springer-Verlag, 2006. [6] R. M. Gray, Entropy and Information Theory.

Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, 1997. [9] B.-H. Juang and L. R. Rabiner, A probabilistic distance measure for hidden Markov models, AT&T Tech. J., vol.

12 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 37 is replaced by. This will be a more intelligent and adaptive way to choose. ACKNOWLEDGMENT The authors would like to thank B. Nittala and M. Haque for assisting with the interpretation of some of the computational results. REFERENCES [1] P. Baldi and S. Brünak, Bioinformatics: A Machine Learning Approach. Cambridge, MA: MIT Press, [2] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic DNA, J. Molec. Biol., vol. 268, pp , [3] C. Burge and S. Karlin, Finding genes in genomic DNA, Curr. Opin. Struct. Biol., vol. 8, pp , [4] A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg, Improved microbial gene identification with GLIMMER, Nucleic Acids Res., vol. 27, no. 23, pp , [5] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics, 2nd ed. New York: Springer-Verlag, [6] R. M. Gray, Entropy and Information Theory. New York: Springer- Verlag, [7] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge, U.K.: Cambridge Univ. Press, [8] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, [9] B.-H. Juang and L. R. Rabiner, A probabilistic distance measure for hidden Markov models, AT&T Tech. J., vol. 64, no. 2, pp , Feb [10] A. Krogh, I. S. Mian, and D. Haussler, A hidden Markov model that finds genes in E. coli DNA, Nucleic Acids Res., vol. 22, no. 22, pp , [11] F. Liese and L. Vajda, On divergences and informations in statistics and information theory, IEEE Trans. Inf. Theory, vol. 52, no. 10, pp , Oct [12] A. V. Lukashin and M. Borodovsky, GeneMark.hmm: New solutions for gene finding, Nucleic Acids Res., vol. 26, no. 4, pp. 11 7, [13] W. H. Majoros and S. L. Salzberg, An empirical analysis of training protocols for probabilistic gene finders, BMC Bioinformat., vol.???, p. 206, [14] K. Marton and P. C. Shields, The positive-divergence and blowing up properties, Israel J. Math, vol. 86, pp , [15] L. W. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, no. 2, pp , Feb [16] Z. Rached, F. Alalaji, and L. L. Campbell, The Kullback-Leibler divergence rate between Markov sources, IEEE Trans. Inf. Theory, vol. 50, no. 5, pp , May [17] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, Microbial gene identification using interpolated Markov models, Nucleic Acids Res., vol. 26, no. 2, pp , [18] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. bhattacharya, and R. Ramaswamy, Prediction of probable genes by Fourier analysis of gene sequences, Computat. Appl. Biosci., vol. 13, no. 3, pp , Jun [19] J. C. Venter et al., The sequence of the human genome, Science, vol. 291, pp , [20] M. Vidyasagar, A realization theory for hidden Markov models: The partial realization problem, in Proc. Symp. Math. Theory Netw. Syst., Kyoto, Japan, Jul. 2006, pp [21] M. Vidyasagar, Bounds on the Kullback-Leibler divergence rate between hidden Markov models, in Proc. IEEE Conf. Decision Control, Dec , Mathukumalli Vidyasagar (F 83) was born in Guntur, India, on September 29, He received the B.S., M.S., and Ph.D. degrees from the University of Wisconsin, Madison, in 1965, 1967, and 1969, respectively, all in electrical engineering. Between 1969 and 1989, he was a Professor of Electrical Engineering with various universities in the United States and Canada. His last overseas job was with the University of Waterloo, Waterloo, ON, Canada from 1980 to In 1989, he returned to India as the Director of the newly-created Centre for Artificial Intelligence and Robotics (CAIR) and built up CAIR into a leading research laboratory of about 40 scientists working on aircraft control, robotics, neural networks, and image processing. In 2000, he joined Tata Consultancy Services (TCS), Hyderabad, India, India s largest IT firm, as an Executive Vice President in charge of Advanced Technology. In this capacity, he created the Advanced Technology Centre (ATC), which currently consists of about 80 engineers and scientists working on e-security, advanced encryption methods, bioinformatics, Open Source/Linux, and smart-card technologies. He is the author or coauthor of nine books and more than 130 papers in archival journals. Dr. Vidyasagar is a Fellow of the Indian Academy of Sciences, the Indian National Science Academy, the Indian National Academy of Engineering, and the Third World Academy of Sciences. He was the recipient of several honors in recognition of his research activities, including the Distinguished Service Citation from the University of Wisconsin at Madison, the 2000 IEEE Hendrik W. Bode Lecture Prize, and the 2008 IEEE Control Systems Award. Sharmila S. Mande received the Ph.D. degree in physics from the Indian Institute of Science, Bangalore, India, in Her research interests include genome informatics, protein crystallography, protein modeling, protein protein interaction, and comparative genomics. She performed research work with the University of Groningen, The Netherlands, University of Washington, Seattle, the Institute of Microbial Technology, Chandigarh, India, and the Post Graduate Institute of Medical Education and Research, Chandigarh, before joining Tata Consultancy Services, Hyderabad, India, in 2001 as head of the Bio-Sciences Division, which is part of TCS Innovation Lab. Ch. V. Siva Kumar Reddy received the M.Tech. degree in computer science and engineering from the University of Hyderabad, Hyderabad, India. He is currently with Tata Consultancy Services (TCS) s Bio-Sciences Division, which is part of the TCS Innovation Lab, Hyderabad. His research interests include computational methods in gene prediction. V. Raja Rao received the M.S. degree in computer science and engineering from the Indian Institute of Technology, Mumbai, India, in He then joined the Bioinformatics Division, Tata Consultancy Services (TCS), Hyderabad, India, and was involved in the development of various bioinformatics products. His research interests include computational methods in gene prediction and parallel computing for bioinformatics. He is currently consulting for TCS at Sequenom Inc., San Diego, CA.

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in