ALL LIVING things consist of DNA, which is a very complex

Size: px
Start display at page:

Download "ALL LIVING things consist of DNA, which is a very complex"

Transcription

1 26 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 The 4M (Mixed Memory Markov Model) Algorithm for Finding Genes in Prokaryotic Genomes Mathukumalli Vidyasagar, Fellow, IEEE, Sharmila S. Mande, Ch. V. Siva Kumar Reddy, and V. V. Raja Rao Abstract In this paper, we present a new algorithm called 4M (mixed memory Markov model) for finding genes from the genomes of prokaryotes. This is achieved by modeling the known coding regions of the genome as a set of sample paths of a multistep Markov chain (call it ) and the known non-coding regions as a set of sample paths of another multistep Markov chain (call it ). The new feature of the 4M algorithm is that different states are allowed to have different memory lengths, in contrast to a fixed multistep Markov model used in GeneMark in its various versions. At the same time, compared with an algorithm like Glimmer3 that uses an interpolation of Markov models of different memory lengths, the statistical significance of the conclusions drawn from the 4M algorithm is quite easy to quantify. Thus, when a whole genome annotation is carried out and several new genes are predicted, it is extremely easy to rank these predictions in terms of the confidence one has in the predictions. The basis of the 4M algorithm is a simple rank condition satisfied by the matrix of frequencies associated with a Markov chain. The 4M algorithm is validated by applying it to 75 organisms belonging to practically all known families of bacteria and archae. The performance of the 4M algorithm is compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It is found that, in a vast majority of cases, the 4M algorithm finds many more genes than it misses, compared with any of the other three algorithms. Next, the 4M algorithm is used to carry out whole genome annotation of 13 organisms by using 50% of the known genes as the training input for the coding model and 20% of the known non-genes as the training input for the non-coding model. After this, all of the open reading frames are classified. It is found that the 4M algorithm is highly specific in that it picks out virtually all of the known genes, while predicting that only a small number of the open reading frames whose status is unknown are genes. Index Terms Algorithm, gene prediction, K L divergence, Markov model, prokaryotes. A. Gene-Finding Problem I. INTRODUCTION ALL LIVING things consist of DNA, which is a very complex molecule arranged in a double helix. DNA consists of a series of nucleotides, where each nucleotide is denoted by the base it contains, namely, A (Adenine), C (Cytosine), G (Guanine), or T (Thymine). The genome of an organism is the listing of one strand of DNA as an enormously long sequence of symbols from the four-symbol alphabet {A, C, G, T}. Certain Manuscript received January 22, 2007; revised September 1, The authors are with the Advanced Technology Centre, Tata Consultancy Services, Software Units Layout, Madhapur, Hyderabad , India ( sagar@atc.tcs.com; sharmila@atc.tcs.com; siva@atc.tcs.com; raja@atc.tcs.com). Digital Object Identifier /TAC parts of the genome correspond to genes that get converted into proteins, while the rest are non-coding regions. In prokaryotes, or lower organisms, the genes are in one continuous stretch, whereas in eukaryotes, or higher organisms, the genes consist of a series of exons, interruped by introns. The junctions between exons and introns are called splice sites, and the detection of splice sites is a very difficult problem. For this reason, the focus in this paper is on finding genes in prokaryotes. It is easy to state some necessary but not sufficient conditions for a stretch of genome to be a gene, which are given as follows. The sequence must begin with the start codon ATG. In some organisms, GTG is also a start codon. The sequence must end with one of the three stop codons, namely, TAA, TAG, or TGA. The length of the sequence must be an exact multiple of three. A stretch of genome that satisfies these conditions is referred to as an open reading frame (ORF). B. Statistical Approaches to Gene-Finding There are in essence two distinct approaches to gene-finding, namely, string-matching and statistical modeling. In string-matching algorithms, one looks for symbol-for-symbol matching, whereas in statistical modeling one looks for similarity of the statistical behavior. If one were to examine genes with the same function across two organisms, then it is likely that the two DNA sequences would match quite well at a symbol-for-symbol level. For instance, if one were to compare the gene that generates insulin in a mouse and in a human, the two strings would be very similar, except that occasionally one would have to introduce a gap in one sequence or the other. This particular problem, namely to determine the best possible match between two sequences, after inserting a few gaps here and there, is known as the optimal gapped alignment problem. If and are the two strings to be aligned and have lengths and, respectively, then it is possible to give an optimal alignment based on dynamic programming, whose complexity is. Parallel implementations of this alignment algorithm are also possible. For further details, see [7] and [5]. On the other hand, if one were to examine two different genes with distinct functionalities but from within the same organism or within the same family of organisms, then the two genes would not be similar at a symbol-for-symbol level. However, it is widely believed that they would match at a statistical level. The idea behind statistical prediction of genes can be summarized as follows. Suppose we are given several known strings of genes and several known strings of non-genes / 2008 IEEE

2 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 27. We think of the s as sample paths of one stochastic process generated by a coding model and of the s as sample paths of another stochastic process generated by a non-coding model. Now suppose is an ORF, and we wish to classify it as being a gene or a non-gene. The logical approach is to use log-likelihood ratio classification. Thus, we compute the probabilities (or likelihoods) and, that is, the likelihood of the string according to the coding model and the non-coding model, respectively. If, then we classify as a gene, whereas if, then we classify as a non-gene. In the unlikely event that both likelihoods are comparable, the method is inconclusive. The existing statistical gene prediction methods differ only in the manner in which the coding model and the non-coding model are constructed. In Genescan [18], the basic premise is that, in coding regions, the four nucleic acid symbols occur roughly with a period of three. Thus, there is no non-coding model as such. Instead, a discrete Fourier transform is taken of the occurrence of each nucleotide symbol, and the value is compared against a threshold. In Genscan [2], [3], the sample paths are viewed as outputs of a hidden Markov model that can also emit a blank symbol. Because of this, the length of the observed string does not match the length of the state sequence, thus leading to an extremely complicated dynamic programming problem. In GeneMark [12], the observed sequences are interpreted as the outputs of a multistep Markov process. Probably the most widely used classification method is Glimmer, which has several refinements to suit specific requirements. Some references to Glimmer can be found in [4] and [17]. We shall return to Glimmer and GeneMark again in Section V, when we present our computational results and compare them against those produced by Glimmer. C. Contributions of This Paper Glimmer uses an interpolated Markov model (IMM) whereby the sample paths are fit with multistep Markov models whose memory varies from 0 to 8. (Note that a Markov process with zero memory is an i.i.d. process.) This requires the estimation of a fairly large number of parameters. To overcome this difficulty, Glimmer uses high-order Markov models only when there is sufficient data to get a reliable estimate. Initially, GeneMark used a fifth-order Markov chain, but subsequent versions use refined versions, including a hidden Markov model (HMM). In contrast, the premise of our study is that, even in a multistep Markov process, different states have different effective memory. This leads to a mixed memory Markov model. Hence, the algorithm is called 4M. We begin by fitting the sample paths of both the coding regions and the non-coding regions with a fifth-order Markov model each. The reason for using a fifth-order model (thus exactly replicating hexamer frequencies) is that lower order models result in noticeably poorer performance, whereas higher order models do not seem to improve the performance very much. Using two fifth-order Markov models for the coding and non-coding regions results in two models having states each. Then, by using a simple singular value condition, many of these states are combined into one common state. In some cases, the resulting reduced-size Markov model has as few as 150 states which is an 85% reduction! In addition to the singular value test to choose the level of reduction permissible, we also use the Kullback Leibler (K-L) divergence rate [9] to bound the error introduced by reducing the size of the state space. This upper bound on the K-L divergence rate can be used to choose a threshold parameter in the rank condition in an intelligent fashion. In addition, the K-L divergence rate is also used to demonstrate that the three-periodicity effect is very pronounced in coding regions but not in the non-coding regions. The statistical significance of the 4M algorithm is rather easy to analyze. As a result, when some ORFs are predicted to be genes using the 4M algorithm, our confidence in the prediction can be readily quantified, and the predictions can be ranked in order of decreasing confidence. In this way, the most confident predictions (if they are also interesting from a biological standpoint) can be followed up for experimental verification. All in all, the conclusion is that the 4M algorithm performs comparably well or somewhat better than Glimmer and Gene- Mark; however, in the case of the 4M algorithm, it is quite easy and straightforward to compute the statistical significance of the conclusions drawn. II. 4M ALGORITHM A. Multistep Markov Models Recall that the statistical approach to gene-finding depends on being able to construct two distinct models, for the coding regions and for the non-coding regions. From purely a mathematical standpoint, we have only one problem at hand, namely, given a set of sample paths of a stationary stochastic process, construct a model for these paths. In other words, both and are constructed using exactly the same methodology, but applied to distinct sets of sample paths. Thus, let us concentrate on this problem formulation. Suppose is a positive integer, and define. (In the case of genomics,.) Suppose is a stationary stochastic process assuming values in, and we have at hand several sample paths of this process. The objective is to construct a stochastic model for the process on the basis of these observations. Suppose an integer is specified, and we know the statistics of the process up to order. This means that the probabilities of occurrence of all -tuples are specified for the process. Let denote the probability of occurence of the string. Thus, if is a string of length, say, then Since the process is stationary, the above probability is independent of. Note that the frequencies must satisfy a set of consistency conditions as follows: There is a well-known procedure that perfectly reproduces the specified statistics by modeling the given

3 28 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 process as a -step Markov process. For brevity, let us use the notation to mean, and so on. Assuming that the process is a -step Markov process means that, if is a string of length larger than, then In short, it is assumed that the probability is not affected by values of when. Moreover, the transition probability of this multistep Markov process is computed as B. 4M Algorithm Here, we introduce the basic idea behind the 4M algorithm and then present the algorithm itself. We begin with a multistep Markov model and then reduce the size of the state space further by using a criterion for determining whether some states are Markovian. The basis for the 4M algorithm is a simple property of Markov processes. Consider a Markov chain evolving over a finite alphabet. Let and consider the frequency of the triplet. Clearly, for any process (Markovian or not), we have Note that in the above formula we simplify notation by writing The above model, though it is often called a -step Markov model, is also a traditional (one-step) Markov model over the larger state space. Suppose are two states. Then a transition from to is possible only if the last symbols of (read from left to right) are the same as the first symbols of ; in other words, it must be the case that and so on. Now, if the process is Markovian, then we have Hence, if we examine the matrix for some In this case, the probability of transition from the state state is given by to the (1) it will have rank one. This is because, with. fixed, it looks like For all other, the transition probability equals zero. It is clear that, though the state transition matrix has dimension, every row contains at most nonzero entries. Such a -step Markov model perfectly reproduces the -tuple frequencies for all. Given a long string of length, we can write.. There is nothing special about using only a single symbol. Suppose is a string of finite length, denoted as usual by. Then, just as above, we have that Thus, if we fix an integer and examine the matrix As a result, Now, in the above summation, all numbers are of reasonable size. To round out the discussion, suppose that the statistics of the process are not known precisely, but need to be inferred on the basis of observing a set of sample paths. This is exactly the problem we have at hand. In this case, one can still apply (1), but with the actual (but unknown) probabilities and replaced by their empirically observed frequencies. Each of these gives an unbiased estimate of the corresponding probability. then has rank one. Conversely, if the semi-infinite matrix has rank one for every, then the process is Markovian. Now suppose we drop the assumption that the process is Markovian, and suppose the matrix has rank one for a particular fixed. An elementary exercise in linear algebra shows that, in such a case, we must have This follows by reversing the above reasoning. Accordingly, let us define a state to be Markovian of order if has rank one. The distinction here is that we are now speaking about an individual state being Markovian as opposed to the entire process. (2)

4 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 29 The above rank one property and definition can also be extended to multistep Markov processes. Suppose is an -step Markov process. This means that, for each fixed, we have Now, if a substring of the string is Markovian, say, then we can take advantage of (3) and do the substitution Hence, for each fixed, the matrix has rank one. Following earlier reasoning, we can define a state to be a Markovian state of order if the matrix has rank one. Let us now consolidate all of this discussion and apply it to reducing the size of the state space of a multi-step Markov process. Suppose, as before, that is a stationary stochastic process assuming values in a finite set. We have already seen that, in order to reproduce perfectly the -tuple frequencies, it suffices to construct a -step Markov model. Suppose that such a multi-step model has indeed been constructed. This model makes use of the -tuple frequencies. Now suppose that, for some integer and some string, it is the case that the matrix has rank one. By elementary linear algebra, this implies that In other words, the conditional probability of finding a symbol at a particular location depends only on the preceding symbols and not on the symbols that precede. Hence, if has rank one, then we can collapse all states of the form for all into a single state. For this reason, we call a Markovian state if has rank one. The interpretation of being a Markovian state is that, when this string occurs, the process has a memory of only time steps and not in general. To implement the reduction in state space, we therefore proceed as follows. Step 1) Compute the vector of -tuple frequencies. Step 2) Set and, for each, compute the matrix. If has rank one, then collapse all states of the form for all into a single state. Repeat this test for all. Step 3) Increase the value of by one and repeat until. When the search process is complete, the initial set of states will have been collapsed into some intermediate number, whose value depends on the -tuple frequencies. Since we are modeling the process as a -step Markov process, in general, for a string of length, we can write (3) (4) in the above formula. This is the reason for calling the algorithm a mixed memory Markov model since different -tuples have memories of different lengths. The preceding theory is exact provided we use true probabilities in the various computations. However, in setting up the multistep Markov model, we are using empirically observed frequencies and not true probabilities. Hence, it is extremely unlikely that any matrix will exactly have rank one. At this point, we take advantage of the fact that we wish to do classification and not modeling. This means that, in constructing the coding model and the non-coding model, it is really not necessary to get the likelihoods exactly right it is sufficient for them to be of the right order of magnitude. Hence, in implementing the 4M algorithm, we take a matrix as having effective rank one if it satisfies the condition where and denote the largest two singular values of the matrix, and is an adjustable threshold parameter. We point out in Section VI that, by using the K-L divergence rate between Markov models, it is possible to choose the threshold intelligently. Setting a state to be a Markovian state even if is not exactly a rank one matrix is equivalent to making the approximation In other words, in the original -step Markov model, the entries in the rows corresponding to all states of the form are modified according to (6). III. K-L DIVERGENCE RATE Here, we introduce the notion of the K-L divergence rate between stochastic processes and its applications to Markov chains. Then, we derive an expression for the K-L divergence rate between the original -step Markov model and the 4M-reduced model. This formula is of interest because the two processes have a common output space, but not a common state space. These results are applied to some problems in genomics in Section IV. A. K-L Divergence Let be an integer, and let denote the -simplex, namely, (5) (6) Thus, is just the set of probability distributions for an -valued random variable. Suppose are two such

5 30 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 probability distributions. Then the K-L divergence between the two vectors is defined as Note that, in order for to be finite, needs to be dominated by, that is,. We write or to denote that is dominated by or that dominates. Here, we adopt the usual convention that. The K-L divergence has several possible interpretations, of which only one is given here. Suppose we are given data generated by an i.i.d. sequence whose one-dimensional marginal distribution is. There are two competing hypotheses, namely, that the probability distribution is and that the probability distribution is, neither of which may be the truth. If we observe a sequence, where is the length of the observation and each has one of the possible values, we compute the likelihood of the observation under each of the two hypotheses and choose the more likely one, that is, the hypothesis that is more compatible with the observed data. In this case, it is easy to show that the expected value of the log-likelihood ratio is precisely equal to. Thus, in the long run, we will choose the hypothesis if and the hypothesis if. In other words, in the long run, we will choose the hypothesis that is closer to the truth. Therefore, even though the K-L divergence is not truly a distance (it does not satisfy either the symmetry property or the triangle inequality), it does induce a partial ordering on the set. The difference is the per-symbol contribution to the log-likelihood ratio. As a parenthetical aside, see [11] for a very general discussion of divergence generated by an arbitrary convex function. For an appropriate choice of the convex function, the corresponding divergence will in fact satisfy the one-sided triangle inequality. However, the popular choice is not one such function. B. K-L Divergence Rate The traditional K-L divergence measure is perfectly fine when the classification problem involves a sequence of independent observations. However, in trying to model a stochastic process via observing it, it is not always natural to assume that the observations are independent. It is therefore desirable to have a generalization of the K-L divergence to the case where the samples may be dependent. Such a generalization is given by the K-L divergence rate. Suppose is some set, and is a stochastic process assuming values in the set. Thus, the stochastic process itself assumes values in the infinite Cartesian product space. Suppose are two probability laws, that is, probability measures on the product space. In principle, we could define the K-L divergence between the two laws by extending the standard definition, using Radon Nikodym derivatives and so on. The trouble is that most of the time the divergence would be infinite and conveys no useful information. Thus, blindly computing the divergence between the two laws of a stochastic process gives no useful information most of the time. (7) To get around this difficulty, it is better to use the K-L divergence rate. It appears that the K-L divergence rate was introduced in [9]. If and are two probability laws on and if is a finite set, we define where and are the marginal distributions on and, respectively, onto the -dimensional product, and is just the conventional K-L divergence (without the rate). The idea is that, in many cases, the pure K-L divergence approaches infinity as. However, dividing by moderates the rate of growth. Moreover, if the ratio has a finite limit as, then the K-L divergence rate gives a measure of the asymptotic rate at which the pure divergence blows up as. The K-L divergence rate has essentially the same interpretation as the K-L divergence. Suppose we are observing a stochastic process whose law is. We are trying to decide between two competing hypotheses: The process has the law, and the process has the law. After samples, the expected value of the log-likelihood ratio is asymptotically equal to. The paper [16] gives a good historical overview of the properties of the K-L divergence. Specifically, in general the K-L divergence may not exist between arbitrary probability measures, but it seems to exist under many reasonable conditions. For example, it is known [6] that, if is a stationary law and is the law of a finite-state Markov process, then the K-L divergence rate is well defined. It is shown in [14] that the K-L divergence rate exists if both laws correspond to ergodic processes. C. K-L Divergence Rate Between Markov Processes In [16], an explicit formula is given for the K-L divergence rate between two Markov processes over a common (finite) state space. We give an alternate version of the formula derived in [16], which generalizes very cleanly to multistep Markov processes. Suppose and are the laws of two Markov processes over a finite set. Thus, are stochastic matrices and are corresponding stationary vectors. Thus,.If is the law of the Markov process and is the law of the Markov process, then it is shown in [16] that where denote the th rows of the matrics and, respectively. In order for the divergence rate to be finite, the two state transition matrices and must satisfy the condition or, in the earlier notation, we must have. We denote this condition by or. (8) (9)

6 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 31 Now we give an alternate formulation of (9) that is in some sense a little more intuitive. Theorem 1: Suppose are stochastic matrices, and let denote associated stationary probability distributions. Thus, and. Let denote the law of the Markov process and let denote the law of the Markov process. Let denote the frequency vector of doublets, under the Markov chain. Similarly let denote the frequency vector of doublets, under the Markov chain. Suppose. Then, the K-L divergence rate between the Markov chains is given by (10) where is the conventional K-L divergence between probability vectors. 1) Remarks: Formula (10) gives a nice interpretation of the K-L divergence rate between Markov chains: it is just the difference between the divergence of the doublet frequencies and the divergence of the singlet frequencies. Moreover, it is easy to extend it to -step Markov chains. The K-L divergence rate is just the difference between the divergence of -tuple frequencies and -tuple frequencies. The proof is omitted in the interests of brevity. It can be found in [21]. In [16], the authors do not give an explicit formula for the K-L divergence rate between multistep Markov chains. There is an analogous formula to (10) in the case of -step Markov models. Define to be the frequency vector of -tuples for the first Markov chain, and define the symbols in the obvious fashion. Then (11) The proof, based on the fact that an -step Markov model is just a one-step Markov model on, is easy and is left to the reader. D. K-L Divergence Rate When the 4M Algorithm is Used Theorem 1 gives the K-L divergence rate between two Markov processes over the same state space. In this paper, we begin with a -step Markov process and approximate it by some other Markov process by applying the 4M algorithm. When we do so, the resulting processes no longer share a common state space. Thus, Theorem 1 no longer applies. The next theorem gives a formula for the K-L divergence rate when the 4M algorithm is used to achieve this reduction. Note that the problem of computing the K-L divergence rate between two entirely arbitrary HMMs with a common output space is still an open problem. Theorem 2: Suppose is a stationary stochastic process and that the frequencies of all -tuples are specified. Let denote the approximation of by a -step Markov process. Suppose now that we apply the 4M algorithm and choose various tuples as Markovian states. Let denote the Markovian states and let denote the length of the Markovian state. Finally, let denote the resulting stochastic process. Then, the K-L divergence rate between the original -step Markov model and the 4M reduced Markov model (or equivalently between the laws of the process and ) is given by (12) Proof: Note that the full -order Markov model has exactly nonzero entries in each row labeled by, and these entries are as varies over. One can think of the reduced-order model obtained by the 4M algorithm as containing the same rows, except that, if is a Markovian state, then the entry in all rows of the form are changed from to. The vector is a stationary distribution of the original -order Markov model. Now (12) readily follows from (11). Note that, in applying the 4M algorithm, we approximate the ratio by the ratio for each string that is deemed to be a Markovian state. Hence, the quantity inside the logarithm in (12) should be quite close to one, and its logarithm should be close to zero. IV. COMPUTATIONAL RESULTS I: APPLICATIONS OF THE K-L DIVERGENCE RATE This section contains the first set of computational results. Here, we study the three-periodicity of coding regions using the K-L divergence rate. The same K-L divergence rate is also used to show that there is virtually no three-periodicity effect in the non-coding regions. Then, we analyze the effect of reducing the size of the state space using the 4M algorithm in terms of the generalization error. A. List of Organisms Analyzed The 4M algorithm was applied to 75 prokaryote genomes of microbial organisms. These genomes comprised both bacteria as well as archae. To save space, in the tables showing the computational results we give the names of the various organisms in a highly abbreviated form. Table V gives a list of all the organisms for which the computational results are presented here, together with the abbreviations used. B. Three-Periodicity of Coding Regions There is an important feature that needs to be built into any stochastic model of genome sequences. It has been observed that, if one were to treat the coding regions as sample paths of a stationary Markov process, then the results are pretty poor. The reason is that genomic sequences exhibit a pronounced threeperiodicity. This means that the conditional probability is not independent of, but is instead periodic with a period of three. Thus, instead of constructing one -step Markov model, we must in fact construct three such models. These are referred to as the Frame 0, Frame 1, and Frame 2 models. We begin from the start of the genome and label the first nucleotide as Frame 0, the second nucleotide as Frame 1, the third

7 32 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 TABLE I DIVERGENCES BETWEEN MARKOV MODELS OF CODING REGIONS TABLE II DIVERGENCES BETWEEN MARKOV MODELS OF NON-CODING REGIONS nucleotide as Frame 2, and loop back to label the fourth nucleotide as Frame 0, and so on. Then the 4M reduction using the rank condition is applied to each frame. Since a three-periodic Markov chain over a state space can also be written as a stationary Markov chain over the state space, there are no conceptual difficulties because of three-periodicity. In this subsection, we use the K-L divergence rate introduced in Section III to assess the significance of three-periodicity in both coding and non-coding regions in various organisms. The study is carried out as follows. In 13 organisms, we constructed three-periodic models for the known coding regions as well as known non-coding regions, using the value. This means that for each organism we constructed six different fifth-order Markov models that perfectly reproduced the observed hexamer frequencies. These models are denoted by respectively (three coding-region models and three non-coding region models). For the three coding region models, we computed six different divergence rates, namely for all. Then, we did the same for the three non-coding region models. (Remember that the K-L divergence rate is not symmetric.) Tables II and III show these six divergence rates for 13 of the organisms listed in Section IV-A, for both coding regions as well as non-coding regions. Actually, we computed 75 such divergences, but only 13 are presented here. To make the tables fit within the two-column format, we use the obvious notation to denote or as appropriate. Note that throughout the base of the logarithm used in (12) is 2. Thus, all logarithms are binary logarithms. From Tables I and II, it is clear that the three-periodicity effect in the non-coding regions is noticeably less than in the coding regions, in the sense that the K-L divergence rates between the three frames of the non-coding models are essentially negligible, compared with the corresponding divergence rates in the coding regions. In fact, except for Mycoplasma genitalium, the divergences in the non-coding regions are essentially negligible. M. genitalium is a peculiar organism, in which the codon TGA codes for the amino acid tryptophane, instead of being a stop codon as it is in practically all other organisms. Thus, when we construct algorithms for predicting genes, we would be justified in ignoring the three-periodicity effect in the non-coding regions. TABLE III SIZES OF 4M-REDUCED MARKOV MODELS C. Reduction in Size of State Space Here, we study the reduction in the size of the state space when the 4M algorithm is used. In applying the 4M algorithm, we used the value. It has been verified numerically that smaller values of do not give good predictions, while larger values of do not lead to any improvement in performance. Thus, seems to be the right value. Hence, for each organism, we constructed three coding region models, and one non-coding region model, each model being fifth-order Markovian. Recall that, due to the three-periodicity of the coding regions, we need three models, one for each frame in the coding region. Since the non-coding region does not show so much of a three-periodicity effect (as demonstrated in the preceding subsection), we ignore that possibility and construct just one model for the non-coding region. Each of the fifth-order Markovian models has states, consisting of pentamers of nucleotides. Then, for each of these four models, we applied the 4M reduction, with the threshold in (5) set somewhat arbitrarily at. A more systematic way to choose is given in Section VI. Recall that the larger the threshold, the larger the number of states that will satisfy the near rank one condition (5), and the greater the reduction in the size of the state space. The CPU time for computing the hexamer frequencies of a genome with about one million base pairs is approximately 10 s on a Intel Pentium IV processor running at 2.8 GHz, while the state space reduction takes just 0.3 s or about 3% of the time

8 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 33 TABLE IV RESULTS OF WHOLE GENOME ANNOTATION USING THE 4M ALGORITHM USING 50% TRAINING DATA TABLE V ABBREVIATIONS OF ORGANISM NAMES AND COMPARISON OF 4M VERSUS GLIMMER 3 needed to compute the frequencies. Thus, once the hexamer frequencies are constructed, the extra effort needed to apply the 4M algorithm is negligible. Table III shows the size of the 4M-reduced state space for each of the 13 organisms studied. All of the numbers in the table should be compared with 1024, which is the number of states of the full fifth-order Markov chain. Moreover, in Glimmer and its variants, one uses up to eighth-order Markov models for certain organisms, meaning that in the worst case the size of the state space could be as high as. From this table, it is clear that in most cases the 4M algorithm leads to fairly significant reduction in the size of the state spaces. There are some dramatic reductions, such as in the case of B. sub, for which the reduction in the size of the state space is of the order of 85%. Moreover, in almost all cases, the size of the state space is reduced by at least 50%. V. COMPUTATIONAL RESULTS II: GENE PREDICTION Now we come to the main topic of this paper, namely, finding genes. One can identify two distinct philosophies in gene-prediction algorithms. Some algorithms, including the one presented here, can be described as bootstrapping. Thus, we begin with some known genes, construct a stochastic model based on those, and then use that model to classify the remaining ORFs as potential genes. The most promising predictions are then validated either through experiment or through comparison with known genes of other similar organisms. The validated genes are added to the training sample and the process is repeated. This is why the process may be called bootstrapping. In contrast, Glimmer (in its several variants), which is among the most popular and most accurate prediction algorithm at present, can be described as an ab initio scheme. In Glimmer, all ORFs longer than 500 base pairs are used as the training set for the coding regions. The premise is that almost all of these are likely to be genes anyway. In principle, we could apply the 4M algorithm with the same initial training set and the results would not be too different from those presented here. For a gemone with one million base pairs, the 4M algorithm required approximately 10 s of CPU time and approximately 5 Mb of storage for training the coding and non-coding models, compared with 10 s of CPU time and 50 Mb of storage for Glimmer3. The prediction problem took about 60 s of CPU time and 20 Mb of storage for 4M versus 13 s of CPU time and 4 Mb of storage for Glimmer3. Our implementation of the 4M algorithm was done using Python, which is very efficient for the

9 34 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 TABLE VI COMPARISON OF 4M ALGORITHM VERSUS GENEMARK 2.5D AND GENEMARKHMM2.6g programmer but very inefficient in terms of CPU time Thus, we believe that there is considerable scope for reduction in both the CPU time as well as the storage requirements when implementing the 4M algorithm.

10 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 35 A. Classification of Annotated Genes Using 4M and Other Methods: Comparison of Results Here, we take the database of annotated genes for each of 75 organisms and classify them using 4M, Glimmer3, Gene- Mark2.5d, and GeneMarkHMM2.6g. The database of annotated genes represents a kind of consensus. Some of the genes in the database are experimentally validated, while others are sufficiently similar at a symbol for symbol level to other known genes that these too are believed to be genes. Thus, it is essential that any algorithm should pick up most if not all of these annotated genes. The test was conducted as follows. To construct the coding and non-coding models for the 4M algorithm, we took some known genes to train the coding model and some known noncoding regions to train the non-coding model. The fraction of known genes used to train the coding model was 50%, that is, we used every other gene. For the non-coding model, we picked around 20% of the known non-coding regions at random. Throughout, we used a three-periodic model for the coding regions and a uniform (i.e., nonperiodic) model for the non-coding regions. These models were then 4M-reduced using the threshold. Then the remaining (known) coding and non-coding regions were classified using the log-likelihood method. In the tables, we have used the following notation. Total Genes denotes the total number of genes in the annotated database. 4M & Gl denotes the genes picked up by both the 4M algorithm and Glimmer3. & denotes the genes missed by both the algorithms. 4M & denotes the genes picked up by the 4M algorithm but missed by Glimmer3. &Gldenotes the genes missed by the 4M algorithm but picked up by Glimmer3. Similar notation is used in Table VI, with Glimmer3 replaced by GeneMark2.5d (denoted by GMK) and GeneMarkHMM (denoted by GHMM). First, we compare the performance of the 4M algorithm against that of Glimmer3, as detailed in Table V. It is worth pointing out that, in the results presented here, there is no postprocessing of the raw output of the 4M algorithm, as is common with other algorithms. The key points of comparison are the numbers in the next-to-last and last columns. From this table, it can be seen that, except in the case of seven organisms (B. jap, C. vio, G. vio, the three Pseudomonas family, and R. etl), 4M finds at least as many genes as it misses, compared with Glimmer3. On the other side, in 32 organisms, the number of annotated genes found by 4M and missed by Glimmer3 is more than double the number missed by 4M and found by Glimmer3. Next, a glance at Table VI reveals that 4M overwhelmingly outperforms GeneMark2.5d, which is an older algorithm based on modeling the genes using a fifth-order Markov model. Since 4M is also based on a fifth-order Markov model but with some reduction in the size of the state space, the vastly superior performance of 4M is intriguing to say the least. Compared with Gen- TABLE VII COMPARISON OF 4M ALGORITHM VERSUS GLIMMER3 ON SHORT GENES emarkhmm2.6g, the superiority of 4M is not so pronounced; nevertheless, 4M has the better performance. To summarize, 4M somewhat outperforms Glimmer3 and GeneMarkHMM2.6g in most cases, and considerably outperforms GeneMark2.5d. Finally, we compared the performance of the 4M algorithm with Glimmer3 on short genes. It is widely accepted that long genes are easy to find using just about any algorithm and that the real test of an algorithm is its ability to find short genes. Since 4M significantly outperforms both versions of GeneMark in any case, we present only the comparison of 4M against Glimmer3 on three sets of genes: those of length less than 150 base pairs, between 151 and 300 base pairs, and between 301 and 500 base pairs. In presenting the results, we omitted any organism where the number of ultrashort genes of length less than 150 base pairs was less than 20. These results are found in Table VII. From this table, it is clear that 4M vastly outperforms Glimmer3 in predicting ultrashort genes and is somewhat superior in finding short genes. In the case of the organism M. genitalium, which has the exceptional property that the codon TGA codes for the amino acid tryptophane instead of being a stop codon, the 4M algorithm performs poorly when this fact is not incorporated. However, when this fact is incorporated, the performance of 4M improves dramatically. This is why there are two rows corresponding to M. genitalium: the first row is with assuming that TGA is a stop codon, and the second assumes that TGA is not a stop codon. We were at first rather startled by the extremely poor performance

11 36 SPECIAL ISSUE ON SYSTEMS BIOLOGY, JANUARY 2008 of the 4M algorithm in the case of M. genitalium, considering that the algorithm performed so well on the rest of the organisms. This caused us to investigate the organism further and led us to discover from the literature that, in fact, M. genitalium has a nonstandard genetic code. The moral of the story is that, by purely statistical analysis, we could find out that there was something unusual about this organism. More interestingly, even in the case of the extremely wellstudied organism E. coli, neither the 4M algorithm nor Glimmer 3 performs particularly well. This kind of poor performance is usually indicative of some nonstandard behavior on the part of the organism, as in the case of M. genitalium. This issue needs to be studied further. B. Whole Genome Annotation Using the 4M Algorithm Here, we carry out whole genome annotation of 13 organisms using the 4M algorithm. First, we identify all of the ORFs in the entire genome. Then, we train the model using every other gene in the database of annotated genes and about 20% of the known noncoding regions to train. Both models are 4M-reduced. Then, the entire set of ORFs are classified using the log-likelihood classification scheme. The legend for the column headings in Table IV is as follows: organism, the number of ORFs, the number of annotated genes, the number of annotated genes that are picked up by 4M and predicted to be genes, the number of other ORFs whose current status is unknown, and finally, the number of these other ORFs that are predicted by 4M to be genes. To save space, we present results for only 13 organisms. Now let us discuss the results in the Table IV. It is reasonable to expect that, since half of the annotated genes are used as the training set, the other half of the annotated genes are picked up by 4M as being genes. However, what is surprising is how few of the other ORFs whose status is unknown are predicted to be genes. Actually, for each organism, the 4M algorithm predicts several hundred ORFs to be additional genes. However, since there are many overlaps amongst these ORFs, we eliminate the overlaps to predict a single gene for each set of overlapping regions. Thus, there is a clear differentiation between the statistical properties of the annotated genes and the ORFs whose status is unknown, and the 4M algorithm is able to differentiate between them. These additional predicted genes are good candidates for experimental verification. In order to prioritize them, they can be ranked in terms of the normalized log-likelihood ratio where denotes the length of the ORF. The reason for normalizing the log-likelihood ratio is that, as becomes large, the raw log-likelihood ratio will also become large. Thus, comparing the raw log-likelihood ratio of two ORFs is not meaningful. However, comparing the normalized log-likelihood ratio allows us to identify the predictions about which we are most confident. This is the advantage of having a stochastic modeling methodology whose significance is easy to analyze. VI. CONCLUSION AND FUTURE WORK We have studied the problem of finding genes from prokaryotic genomes using stochastic modeling techniques. The genefinding problem is formulated as one of classifying a given sequence of bases using two distinct Markovian models for the coding regions and the non-coding regions, respectively. For the coding region, we construct a three-periodic fifth-order Markovian model, whereas for the non-coding we construct a fifthorder Markovian model (ignoring the three-periodicity effect). Then, we introduced a new method known as 4M that allows us to assign variable length memories to each symbol, thus permitting a substantial reduction in the size of the state space of the Markovian model. The disparities between various models have been quantified using the K-L divergence rate between Markov processes. The K-L divergence rate has a number of useful applications, some of which are brought out in this paper. For instance, using this measure, it has been conclusively demonstrated that the threeperiodicity effect is much more pronounced in coding regions than in non-coding regions. This is why we could ignore threeperiodicity in non-coding regions. An explicit formula has been given for the K-L divergence rate between a fifth-order Markov model and the 4M-reduced model. This formula allows us to quantify the classification error resulting from this model order reduction. Using this new algorithm, we annotated 75 different microbial genomes from several classes of bacteria and archae. The performance of the 4M algorithm was then compared with those of Glimmer3, GeneMark2.5d, and GeneMarkHMM2.6g. It has been shown that the 4M algorithm somewhat outperforms Glimmer3 and considerably outperforms both versions of GeneMark. When it comes to finding ultrashort and short genes, the 4M algorithm significantly outperforms even Glimmer3. We also carried out whole genome annotations of all ORFS in several organisms using the 4M algorithm. We found that, while the 4M algorithm detects an overwhelming majority ( ) of annotated genes as genes, it picks up a surprisingly small fraction of the remaining ORFs as genes. Thus, the 4M algorithm is able to differentiate very clearly between the known genes and the unknown ORFs. Moreover, since the 4M algorithm uses a simple log-likelihood test, it is possible to rank all the predicted genes in terms of decreasing log-likelihood ratio. In this way, the most confident predictions can be tried out first. Formula (12) can be used to choose the threshold in (5) in an adaptive manner. Let denote the laws of the full fifth-order Markov models for the coding regions and the non-coding regions, respectively, and let denote the law of the 4M-reduced model, obtained using a threshold. We should choose to be as large as possible while maintaining the constraint where is a new adjustable parameter. If we choose a very small value of, say, then the log-likelihood ratio between the coding and non-coding models will hardly be affected, if

12 VIDYASAGAR et al.: THE 4M (MIXED MEMORY MARKOV MODEL) ALGORITHM 37 is replaced by. This will be a more intelligent and adaptive way to choose. ACKNOWLEDGMENT The authors would like to thank B. Nittala and M. Haque for assisting with the interpretation of some of the computational results. REFERENCES [1] P. Baldi and S. Brünak, Bioinformatics: A Machine Learning Approach. Cambridge, MA: MIT Press, [2] C. Burge and S. Karlin, Prediction of complete gene structures in human genomic DNA, J. Molec. Biol., vol. 268, pp , [3] C. Burge and S. Karlin, Finding genes in genomic DNA, Curr. Opin. Struct. Biol., vol. 8, pp , [4] A. L. Delcher, D. Harmon, S. Kasif, O. White, and S. L. Salzberg, Improved microbial gene identification with GLIMMER, Nucleic Acids Res., vol. 27, no. 23, pp , [5] W. J. Ewens and G. R. Grant, Statistical Methods in Bioinformatics, 2nd ed. New York: Springer-Verlag, [6] R. M. Gray, Entropy and Information Theory. New York: Springer- Verlag, [7] D. Gusfield, Algorithms on Strings, Trees and Sequences: Computer Science and Computational Biology. Cambridge, U.K.: Cambridge Univ. Press, [8] F. Jelinek, Statistical Methods for Speech Recognition. Cambridge, MA: MIT Press, [9] B.-H. Juang and L. R. Rabiner, A probabilistic distance measure for hidden Markov models, AT&T Tech. J., vol. 64, no. 2, pp , Feb [10] A. Krogh, I. S. Mian, and D. Haussler, A hidden Markov model that finds genes in E. coli DNA, Nucleic Acids Res., vol. 22, no. 22, pp , [11] F. Liese and L. Vajda, On divergences and informations in statistics and information theory, IEEE Trans. Inf. Theory, vol. 52, no. 10, pp , Oct [12] A. V. Lukashin and M. Borodovsky, GeneMark.hmm: New solutions for gene finding, Nucleic Acids Res., vol. 26, no. 4, pp. 11 7, [13] W. H. Majoros and S. L. Salzberg, An empirical analysis of training protocols for probabilistic gene finders, BMC Bioinformat., vol.???, p. 206, [14] K. Marton and P. C. Shields, The positive-divergence and blowing up properties, Israel J. Math, vol. 86, pp , [15] L. W. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proc. IEEE, vol. 77, no. 2, pp , Feb [16] Z. Rached, F. Alalaji, and L. L. Campbell, The Kullback-Leibler divergence rate between Markov sources, IEEE Trans. Inf. Theory, vol. 50, no. 5, pp , May [17] S. L. Salzberg, A. L. Delcher, S. Kasif, and O. White, Microbial gene identification using interpolated Markov models, Nucleic Acids Res., vol. 26, no. 2, pp , [18] S. Tiwari, S. Ramachandran, A. Bhattacharya, S. bhattacharya, and R. Ramaswamy, Prediction of probable genes by Fourier analysis of gene sequences, Computat. Appl. Biosci., vol. 13, no. 3, pp , Jun [19] J. C. Venter et al., The sequence of the human genome, Science, vol. 291, pp , [20] M. Vidyasagar, A realization theory for hidden Markov models: The partial realization problem, in Proc. Symp. Math. Theory Netw. Syst., Kyoto, Japan, Jul. 2006, pp [21] M. Vidyasagar, Bounds on the Kullback-Leibler divergence rate between hidden Markov models, in Proc. IEEE Conf. Decision Control, Dec , Mathukumalli Vidyasagar (F 83) was born in Guntur, India, on September 29, He received the B.S., M.S., and Ph.D. degrees from the University of Wisconsin, Madison, in 1965, 1967, and 1969, respectively, all in electrical engineering. Between 1969 and 1989, he was a Professor of Electrical Engineering with various universities in the United States and Canada. His last overseas job was with the University of Waterloo, Waterloo, ON, Canada from 1980 to In 1989, he returned to India as the Director of the newly-created Centre for Artificial Intelligence and Robotics (CAIR) and built up CAIR into a leading research laboratory of about 40 scientists working on aircraft control, robotics, neural networks, and image processing. In 2000, he joined Tata Consultancy Services (TCS), Hyderabad, India, India s largest IT firm, as an Executive Vice President in charge of Advanced Technology. In this capacity, he created the Advanced Technology Centre (ATC), which currently consists of about 80 engineers and scientists working on e-security, advanced encryption methods, bioinformatics, Open Source/Linux, and smart-card technologies. He is the author or coauthor of nine books and more than 130 papers in archival journals. Dr. Vidyasagar is a Fellow of the Indian Academy of Sciences, the Indian National Science Academy, the Indian National Academy of Engineering, and the Third World Academy of Sciences. He was the recipient of several honors in recognition of his research activities, including the Distinguished Service Citation from the University of Wisconsin at Madison, the 2000 IEEE Hendrik W. Bode Lecture Prize, and the 2008 IEEE Control Systems Award. Sharmila S. Mande received the Ph.D. degree in physics from the Indian Institute of Science, Bangalore, India, in Her research interests include genome informatics, protein crystallography, protein modeling, protein protein interaction, and comparative genomics. She performed research work with the University of Groningen, The Netherlands, University of Washington, Seattle, the Institute of Microbial Technology, Chandigarh, India, and the Post Graduate Institute of Medical Education and Research, Chandigarh, before joining Tata Consultancy Services, Hyderabad, India, in 2001 as head of the Bio-Sciences Division, which is part of TCS Innovation Lab. Ch. V. Siva Kumar Reddy received the M.Tech. degree in computer science and engineering from the University of Hyderabad, Hyderabad, India. He is currently with Tata Consultancy Services (TCS) s Bio-Sciences Division, which is part of the TCS Innovation Lab, Hyderabad. His research interests include computational methods in gene prediction. V. Raja Rao received the M.S. degree in computer science and engineering from the Indian Institute of Technology, Mumbai, India, in He then joined the Bioinformatics Division, Tata Consultancy Services (TCS), Hyderabad, India, and was involved in the development of various bioinformatics products. His research interests include computational methods in gene prediction and parallel computing for bioinformatics. He is currently consulting for TCS at Sequenom Inc., San Diego, CA.

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

SOME CHALLENGES IN COMPUTATIONAL BIOLOGY

SOME CHALLENGES IN COMPUTATIONAL BIOLOGY SOME CHALLENGES IN COMPUTATIONAL BIOLOGY M. Vidyasagar Advanced Technology Centre Tata Consultancy Services 6th Floor, Khan Lateefkhan Building Hyderabad 500 001, INDIA sagar@atc.tcs.co.in Keywords: Computational

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

One-Parameter Processes, Usually Functions of Time

One-Parameter Processes, Usually Functions of Time Chapter 4 One-Parameter Processes, Usually Functions of Time Section 4.1 defines one-parameter processes, and their variations (discrete or continuous parameter, one- or two- sided parameter), including

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction

A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction A General Method for Combining Predictors Tested on Protein Secondary Structure Prediction Jakob V. Hansen Department of Computer Science, University of Aarhus Ny Munkegade, Bldg. 540, DK-8000 Aarhus C,

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

FORMULATION OF THE LEARNING PROBLEM

FORMULATION OF THE LEARNING PROBLEM FORMULTION OF THE LERNING PROBLEM MIM RGINSKY Now that we have seen an informal statement of the learning problem, as well as acquired some technical tools in the form of concentration inequalities, we

More information

Slope Fields: Graphing Solutions Without the Solutions

Slope Fields: Graphing Solutions Without the Solutions 8 Slope Fields: Graphing Solutions Without the Solutions Up to now, our efforts have been directed mainly towards finding formulas or equations describing solutions to given differential equations. Then,

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Stochastic Processes

Stochastic Processes qmc082.tex. Version of 30 September 2010. Lecture Notes on Quantum Mechanics No. 8 R. B. Griffiths References: Stochastic Processes CQT = R. B. Griffiths, Consistent Quantum Theory (Cambridge, 2002) DeGroot

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence Page Hidden Markov models and multiple sequence alignment Russ B Altman BMI 4 CS 74 Some slides borrowed from Scott C Schmidler (BMI graduate student) References Bioinformatics Classic: Krogh et al (994)

More information

Probability and Statistics

Probability and Statistics Probability and Statistics Kristel Van Steen, PhD 2 Montefiore Institute - Systems and Modeling GIGA - Bioinformatics ULg kristel.vansteen@ulg.ac.be CHAPTER 4: IT IS ALL ABOUT DATA 4a - 1 CHAPTER 4: IT

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

CLASSICAL error control codes have been designed

CLASSICAL error control codes have been designed IEEE TRANSACTIONS ON INFORMATION THEORY, VOL 56, NO 3, MARCH 2010 979 Optimal, Systematic, q-ary Codes Correcting All Asymmetric and Symmetric Errors of Limited Magnitude Noha Elarief and Bella Bose, Fellow,

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

STRUCTURAL BIOINFORMATICS I. Fall 2015

STRUCTURAL BIOINFORMATICS I. Fall 2015 STRUCTURAL BIOINFORMATICS I Fall 2015 Info Course Number - Classification: Biology 5411 Class Schedule: Monday 5:30-7:50 PM, SERC Room 456 (4 th floor) Instructors: Vincenzo Carnevale - SERC, Room 704C;

More information

The Integers. Peter J. Kahn

The Integers. Peter J. Kahn Math 3040: Spring 2009 The Integers Peter J. Kahn Contents 1. The Basic Construction 1 2. Adding integers 6 3. Ordering integers 16 4. Multiplying integers 18 Before we begin the mathematics of this section,

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces

A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 9, NO. 4, MAY 2001 411 A Modified Baum Welch Algorithm for Hidden Markov Models with Multiple Observation Spaces Paul M. Baggenstoss, Member, IEEE

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH

HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH HYPERGRAPH BASED SEMI-SUPERVISED LEARNING ALGORITHMS APPLIED TO SPEECH RECOGNITION PROBLEM: A NOVEL APPROACH Hoang Trang 1, Tran Hoang Loc 1 1 Ho Chi Minh City University of Technology-VNU HCM, Ho Chi

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

objective functions...

objective functions... objective functions... COFFEE (Notredame et al. 1998) measures column by column similarity between pairwise and multiple sequence alignments assumes that the pairwise alignments are optimal assumes a set

More information

THE HEAVY-TRAFFIC BOTTLENECK PHENOMENON IN OPEN QUEUEING NETWORKS. S. Suresh and W. Whitt AT&T Bell Laboratories Murray Hill, New Jersey 07974

THE HEAVY-TRAFFIC BOTTLENECK PHENOMENON IN OPEN QUEUEING NETWORKS. S. Suresh and W. Whitt AT&T Bell Laboratories Murray Hill, New Jersey 07974 THE HEAVY-TRAFFIC BOTTLENECK PHENOMENON IN OPEN QUEUEING NETWORKS by S. Suresh and W. Whitt AT&T Bell Laboratories Murray Hill, New Jersey 07974 ABSTRACT This note describes a simulation experiment involving

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands

1/22/13. Example: CpG Island. Question 2: Finding CpG Islands I529: Machine Learning in Bioinformatics (Spring 203 Hidden Markov Models Yuzhen Ye School of Informatics and Computing Indiana Univerty, Bloomington Spring 203 Outline Review of Markov chain & CpG island

More information

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore

Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Pattern Recognition Prof. P. S. Sastry Department of Electronics and Communication Engineering Indian Institute of Science, Bangalore Lecture - 27 Multilayer Feedforward Neural networks with Sigmoidal

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Interpolated Markov Models for Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the

More information

Elementary Linear Algebra, Second Edition, by Spence, Insel, and Friedberg. ISBN Pearson Education, Inc., Upper Saddle River, NJ.

Elementary Linear Algebra, Second Edition, by Spence, Insel, and Friedberg. ISBN Pearson Education, Inc., Upper Saddle River, NJ. 2008 Pearson Education, Inc., Upper Saddle River, NJ. All rights reserved. APPENDIX: Mathematical Proof There are many mathematical statements whose truth is not obvious. For example, the French mathematician

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier *

Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Protein Structure Prediction Using Multiple Artificial Neural Network Classifier * Hemashree Bordoloi and Kandarpa Kumar Sarma Abstract. Protein secondary structure prediction is the method of extracting

More information

Discriminative Direction for Kernel Classifiers

Discriminative Direction for Kernel Classifiers Discriminative Direction for Kernel Classifiers Polina Golland Artificial Intelligence Lab Massachusetts Institute of Technology Cambridge, MA 02139 polina@ai.mit.edu Abstract In many scientific and engineering

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Radiological Control Technician Training Fundamental Academic Training Study Guide Phase I

Radiological Control Technician Training Fundamental Academic Training Study Guide Phase I Module 1.01 Basic Mathematics and Algebra Part 4 of 9 Radiological Control Technician Training Fundamental Academic Training Phase I Coordinated and Conducted for the Office of Health, Safety and Security

More information

Introduction. Chapter 1

Introduction. Chapter 1 Chapter 1 Introduction In this book we will be concerned with supervised learning, which is the problem of learning input-output mappings from empirical data (the training dataset). Depending on the characteristics

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

On Objectivity and Models for Measuring. G. Rasch. Lecture notes edited by Jon Stene.

On Objectivity and Models for Measuring. G. Rasch. Lecture notes edited by Jon Stene. On Objectivity and Models for Measuring By G. Rasch Lecture notes edited by Jon Stene. On Objectivity and Models for Measuring By G. Rasch Lectures notes edited by Jon Stene. 1. The Basic Problem. Among

More information

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka 819-0395, Japan. daisuke@inf.kyushu-u.ac.jp

More information

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations

Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations Quantitative Understanding in Biology Module III: Linear Difference Equations Lecture I: Introduction to Linear Difference Equations Introductory Remarks This section of the course introduces dynamic systems;

More information

The Integers. Math 3040: Spring Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers Multiplying integers 12

The Integers. Math 3040: Spring Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers Multiplying integers 12 Math 3040: Spring 2011 The Integers Contents 1. The Basic Construction 1 2. Adding integers 4 3. Ordering integers 11 4. Multiplying integers 12 Before we begin the mathematics of this section, it is worth

More information

Biological Systems: Open Access

Biological Systems: Open Access Biological Systems: Open Access Biological Systems: Open Access Liu and Zheng, 2016, 5:1 http://dx.doi.org/10.4172/2329-6577.1000153 ISSN: 2329-6577 Research Article ariant Maps to Identify Coding and

More information

Today s Lecture: HMMs

Today s Lecture: HMMs Today s Lecture: HMMs Definitions Examples Probability calculations WDAG Dynamic programming algorithms: Forward Viterbi Parameter estimation Viterbi training 1 Hidden Markov Models Probability models

More information

THE information capacity is one of the most important

THE information capacity is one of the most important 256 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 44, NO. 1, JANUARY 1998 Capacity of Two-Layer Feedforward Neural Networks with Binary Weights Chuanyi Ji, Member, IEEE, Demetri Psaltis, Senior Member,

More information

WAITING-TIME DISTRIBUTION FOR THE r th OCCURRENCE OF A COMPOUND PATTERN IN HIGHER-ORDER MARKOVIAN SEQUENCES

WAITING-TIME DISTRIBUTION FOR THE r th OCCURRENCE OF A COMPOUND PATTERN IN HIGHER-ORDER MARKOVIAN SEQUENCES WAITING-TIME DISTRIBUTION FOR THE r th OCCURRENCE OF A COMPOUND PATTERN IN HIGHER-ORDER MARKOVIAN SEQUENCES Donald E. K. Martin 1 and John A. D. Aston 2 1 Mathematics Department, Howard University, Washington,

More information

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations

A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. XX, NO. X, MONTH 2007 1 A Simple Left-to-Right Algorithm for Minimal Weight Signed Radix-r Representations James A. Muir Abstract We present a simple algorithm

More information

Hidden Markov Models and Their Applications in Biological Sequence Analysis

Hidden Markov Models and Their Applications in Biological Sequence Analysis Hidden Markov Models and Their Applications in Biological Sequence Analysis Byung-Jun Yoon Dept. of Electrical & Computer Engineering Texas A&M University, College Station, TX 77843-3128, USA Abstract

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

4452 Mathematical Modeling Lecture 16: Markov Processes

4452 Mathematical Modeling Lecture 16: Markov Processes Math Modeling Lecture 16: Markov Processes Page 1 4452 Mathematical Modeling Lecture 16: Markov Processes Introduction A stochastic model is one in which random effects are incorporated into the model.

More information

Stochastic Histories. Chapter Introduction

Stochastic Histories. Chapter Introduction Chapter 8 Stochastic Histories 8.1 Introduction Despite the fact that classical mechanics employs deterministic dynamical laws, random dynamical processes often arise in classical physics, as well as in

More information

Recent Advances in Bayesian Inference Techniques

Recent Advances in Bayesian Inference Techniques Recent Advances in Bayesian Inference Techniques Christopher M. Bishop Microsoft Research, Cambridge, U.K. research.microsoft.com/~cmbishop SIAM Conference on Data Mining, April 2004 Abstract Bayesian

More information

ECE521 Lecture 19 HMM cont. Inference in HMM

ECE521 Lecture 19 HMM cont. Inference in HMM ECE521 Lecture 19 HMM cont. Inference in HMM Outline Hidden Markov models Model definitions and notations Inference in HMMs Learning in HMMs 2 Formally, a hidden Markov model defines a generative process

More information

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems

A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems A Probabilistic Relational Model for Characterizing Situations in Dynamic Multi-Agent Systems Daniel Meyer-Delius 1, Christian Plagemann 1, Georg von Wichert 2, Wendelin Feiten 2, Gisbert Lawitzky 2, and

More information

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories

Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Halting and Equivalence of Program Schemes in Models of Arbitrary Theories Dexter Kozen Cornell University, Ithaca, New York 14853-7501, USA, kozen@cs.cornell.edu, http://www.cs.cornell.edu/~kozen In Honor

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg Temporal Reasoning Kai Arras, University of Freiburg 1 Temporal Reasoning Contents Introduction Temporal Reasoning Hidden Markov Models Linear Dynamical Systems (LDS) Kalman Filter 2 Temporal Reasoning

More information

IN this paper, we consider the capacity of sticky channels, a

IN this paper, we consider the capacity of sticky channels, a 72 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 54, NO. 1, JANUARY 2008 Capacity Bounds for Sticky Channels Michael Mitzenmacher, Member, IEEE Abstract The capacity of sticky channels, a subclass of insertion

More information

WE study the capacity of peak-power limited, single-antenna,

WE study the capacity of peak-power limited, single-antenna, 1158 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 3, MARCH 2010 Gaussian Fading Is the Worst Fading Tobias Koch, Member, IEEE, and Amos Lapidoth, Fellow, IEEE Abstract The capacity of peak-power

More information

Incompatibility Paradoxes

Incompatibility Paradoxes Chapter 22 Incompatibility Paradoxes 22.1 Simultaneous Values There is never any difficulty in supposing that a classical mechanical system possesses, at a particular instant of time, precise values of

More information

1 Introduction. Exploring a New Method for Classification with Local Time Dependence. Blakeley B. McShane 1

1 Introduction. Exploring a New Method for Classification with Local Time Dependence. Blakeley B. McShane 1 Exploring a New Method for Classification with Local Time Dependence Blakeley B. McShane 1 Abstract We have developed a sophisticated new statistical methodology which allows machine learning methods to

More information

The tape of M. Figure 3: Simulation of a Turing machine with doubly infinite tape

The tape of M. Figure 3: Simulation of a Turing machine with doubly infinite tape UG3 Computability and Intractability (2009-2010): Note 4 4. Bells and whistles. In defining a formal model of computation we inevitably make a number of essentially arbitrary design decisions. These decisions

More information

Information Theory. Week 4 Compressing streams. Iain Murray,

Information Theory. Week 4 Compressing streams. Iain Murray, Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 4 Compressing streams Iain Murray, 2014 School of Informatics, University of Edinburgh Jensen s inequality For convex functions: E[f(x)]

More information

Computing Consecutive-Type Reliabilities Non-Recursively

Computing Consecutive-Type Reliabilities Non-Recursively IEEE TRANSACTIONS ON RELIABILITY, VOL. 52, NO. 3, SEPTEMBER 2003 367 Computing Consecutive-Type Reliabilities Non-Recursively Galit Shmueli Abstract The reliability of consecutive-type systems has been

More information

Bioinformatics 2 - Lecture 4

Bioinformatics 2 - Lecture 4 Bioinformatics 2 - Lecture 4 Guido Sanguinetti School of Informatics University of Edinburgh February 14, 2011 Sequences Many data types are ordered, i.e. you can naturally say what is before and what

More information

Introduction to Hidden Markov Models (HMMs)

Introduction to Hidden Markov Models (HMMs) Introduction to Hidden Markov Models (HMMs) But first, some probability and statistics background Important Topics 1.! Random Variables and Probability 2.! Probability Distributions 3.! Parameter Estimation

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

Gene Prediction in Eukaryotes Using Machine Learning

Gene Prediction in Eukaryotes Using Machine Learning Aarhus University Master s Thesis Gene Prediction in Eukaryotes Using Machine Learning Author Lukasz Telezynski Supervisor Christian Storm Pedersen June 15, 2018 ii Abstract Gene prediction, especially

More information

arxiv:cs/ v2 [cs.it] 1 Oct 2006

arxiv:cs/ v2 [cs.it] 1 Oct 2006 A General Computation Rule for Lossy Summaries/Messages with Examples from Equalization Junli Hu, Hans-Andrea Loeliger, Justin Dauwels, and Frank Kschischang arxiv:cs/060707v [cs.it] 1 Oct 006 Abstract

More information

Hardy s Paradox. Chapter Introduction

Hardy s Paradox. Chapter Introduction Chapter 25 Hardy s Paradox 25.1 Introduction Hardy s paradox resembles the Bohm version of the Einstein-Podolsky-Rosen paradox, discussed in Chs. 23 and 24, in that it involves two correlated particles,

More information

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

Lecture 7: DecisionTrees

Lecture 7: DecisionTrees Lecture 7: DecisionTrees What are decision trees? Brief interlude on information theory Decision tree construction Overfitting avoidance Regression trees COMP-652, Lecture 7 - September 28, 2009 1 Recall:

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes (bilmes@cs.berkeley.edu) International Computer Science Institute

More information

Classification and Regression Trees

Classification and Regression Trees Classification and Regression Trees Ryan P Adams So far, we have primarily examined linear classifiers and regressors, and considered several different ways to train them When we ve found the linearity

More information

CHAPTER 2 INTRODUCTION TO CLASSICAL PROPOSITIONAL LOGIC

CHAPTER 2 INTRODUCTION TO CLASSICAL PROPOSITIONAL LOGIC CHAPTER 2 INTRODUCTION TO CLASSICAL PROPOSITIONAL LOGIC 1 Motivation and History The origins of the classical propositional logic, classical propositional calculus, as it was, and still often is called,

More information

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM

BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM BAYESIAN STRUCTURAL INFERENCE AND THE BAUM-WELCH ALGORITHM DMITRY SHEMETOV UNIVERSITY OF CALIFORNIA AT DAVIS Department of Mathematics dshemetov@ucdavis.edu Abstract. We perform a preliminary comparative

More information

Lecture 4: Training a Classifier

Lecture 4: Training a Classifier Lecture 4: Training a Classifier Roger Grosse 1 Introduction Now that we ve defined what binary classification is, let s actually train a classifier. We ll approach this problem in much the same way as

More information

CS229 Supplemental Lecture notes

CS229 Supplemental Lecture notes CS229 Supplemental Lecture notes John Duchi 1 Boosting We have seen so far how to solve classification (and other) problems when we have a data representation already chosen. We now talk about a procedure,

More information

Commentary. Regression toward the mean: a fresh look at an old story

Commentary. Regression toward the mean: a fresh look at an old story Regression toward the mean: a fresh look at an old story Back in time, when I took a statistics course from Professor G., I encountered regression toward the mean for the first time. a I did not understand

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Information in Biology

Information in Biology Information in Biology CRI - Centre de Recherches Interdisciplinaires, Paris May 2012 Information processing is an essential part of Life. Thinking about it in quantitative terms may is useful. 1 Living

More information

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence

Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras. Lecture - 13 Conditional Convergence Real Analysis Prof. S.H. Kulkarni Department of Mathematics Indian Institute of Technology, Madras Lecture - 13 Conditional Convergence Now, there are a few things that are remaining in the discussion

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Lecture notes on Turing machines

Lecture notes on Turing machines Lecture notes on Turing machines Ivano Ciardelli 1 Introduction Turing machines, introduced by Alan Turing in 1936, are one of the earliest and perhaps the best known model of computation. The importance

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization

CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization CS168: The Modern Algorithmic Toolbox Lecture #6: Regularization Tim Roughgarden & Gregory Valiant April 18, 2018 1 The Context and Intuition behind Regularization Given a dataset, and some class of models

More information