Searching genomes for non-coding RNA using FastR

Size: px
Start display at page:

Download "Searching genomes for non-coding RNA using FastR"

Transcription

1 Searching genomes for non-coding RNA using FastR Shaojie Zhang Brian Haas Eleazar Eskin Vineet Bafna Keywords: non-coding RNA, database search, filtration, riboswitch, bacterial genome. Address for correspondence: Vineet Bafna, Ph.D. Department of Computer Science and Engineering University of California at San Diego 9500 Gilman Drive La Jolla, CA Phone: (Office), (Fax) Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA Ph: (W), (F) Bioinformatics Program, Department of Bioengineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA Ph: (W) Department of Computer Science and Engineering, University of California at San Diego, 9500 Gilman Drive, La Jolla, CA Ph: (W), (F) 1

2 Abstract The discovery of novel non-coding RNAs has been among the most exciting recent developments in Biology. Yet, many more remain undiscovered. It has been hypothesized that there is in fact an abundance of functional non-coding RNA (ncrna) with various catalytic and regulatory functions. However, the inherent signal for ncrna is weaker than the signal for protein coding genes, making these harder to identify. We consider the following problem: Given an RNA sequence with a known secondary structure, efficiently compute all structural homologs (computed as a function of sequence and structure similarity) in a genomic database. Our approach, based on structural filters that eliminate a large portion of the database, while retaining the true homologs, allows us to search a typical bacterial database in minutes on a standard PC, with high sensitivity and specificity. This is two orders of magnitude better than currently available software for the problem. We applied FastR to the discovery of novel riboswitches, which are a class of RNA domains found in the UTR regions. They are of interest because they regulate metabolite synthesis by directly binding metabolites. We searched all available eubacterial and archaeal genomes for riboswitches from purine, lysine, thiamin, and riboflavin sub-families. Our results point to a number of novel candidates for each of these sub-families, and include genomes that were not known to contain riboswitches. 1 INTRODUCTION Not all genes encode proteins. Non-coding RNA (ncrna) form transcripts that are functional molecules by themselves. The involvement of ncrna in translation (trna), splicing and other cellular functions is well-known. As early as 1961, Jacob and Monod hypothesized complementary 2

3 roles for the two classes of genes, proposing that Structural genes encode proteins, and regulatory genes produce ncrna [10]. Until recently, however, most novel gene discovery was in the form of protein coding genes, and discovery of ncrna was limited to finding novel homologs of commonly occurring ncrnas (such as trna and rrna). In part, these discoveries were fuelled by advances in genomic (EST) and computational technologies as well as large scale genome sequencing projects leading up to publications of large Eukaryotic genomes [13, 26, 28]. The complete genomic sequence allowed us to refine the estimates of the number of human (coding) genes. Surprisingly, these current estimates ( genes) comprise less than 2% of the genome, far lower than earlier estimates, and only twice as many as in Drosophila. It is an intriguing question if these genes and their (alternatively spliced) protein products are sufficient to carry out complex cellular functions. Could it be that many cellular functions are carried out by as yet undiscovered, ncrna? Recent discoveries show that this idea of a treasure trove of undiscovered ncrna is not without merit. The discovery of endogenous small interfering RNA (RNAi) has generated a lot of excitement (See, for example, [19]). Targeted search for other non-coding RNA (ncrna) [1, 15, 16] has led to surprising discoveries of novel sub-families of ncrna. Some ncrna are not independently transcribed but occur as part of the untranslated regions of mrna. For example, Riboswitches are ncrna elements that often occur in the 5 Untranslated Region (UTR) regions and regulate the transcription of the downstream gene by directly binding to metabolites [18, 27]. It is hypothesized that there is in fact an abundance of undiscovered, functional ncrna with various catalytic and regulatory functions (The Modern RNA world [7]). The reason these genes remain undiscovered is because genomic and computational tools for finding ncrna are not as advanced as those for protein coding genes. Various computational approaches to detecting non-coding genes are under investigation. 3

4 Some of these are attempts at de novo prediction, looking for signals that might suggest a functional RNA in the molecule. The most promising approach seemed to be the use of secondary structure as a signal [3, 14] to discover RNA. This approach builds upon extensive earlier research into predicting the secondary structure of an RNA molecule [31, 11]. However, recent reports [20, 30] have concluded that the secondary structure signal is not sufficient to detect ncrna, and random sequences with a biased GC composition, or with a di-nucleotide composition similar to true RNA sequences, usually allow folding into energetically favorable secondary structures. Other de novo approaches include looking for the transcription start and similar signals, but have had limited success. The consensus is that the ncrna signals in a genome are not as strong as the signals for protein coding genes. Therefore, a natural approach to the problem is based on comparative methods. One approach is to consider the evidence for RNA structure in sequences that are conserved through evolution [21, 5, 4]. However, if the sequences have diverged, it is not easy to find candidates. Other methods seek to identify ncrna by comparing structure and sequence similarity [17]. However, these are usually tailored to finding homologs of a specific RNA, such as trna. Recently, Klein and Eddy [12] developed a tool, RSEARCH, for searching a database with a general ncrna query. The method depends upon existing algorithms for computing alignments between an RNA sequence and substrings of a database, where the alignment score is a function of sequence and structural similarity. Known algorithms for computing such alignments are computationally intensive (approximately O(mw 2 n), where m is the length of the query sequence, n is the length of the database sequence, and w is the maximum length of a database substring that is aligned to the query). For a test run on an Intel/Linux PC with 2.8GHz, 1Gb memory, a microbial database of size 1.67M, and a query 5S rrna sequence, RSEARCH took over 6.5 hrs. to run. This makes it intractable for large genomic database. 4

5 In this paper, we describe FastR, an efficient database search tool for ncrna. As an example, FastR reduces the compute time of the previously mentioned query to 103s. Are such algorithmic improvements worth investigating? An analogy can be made with the BLAST algorithm, which has had tremendous influence on the growth of sequence databases such as Genbank, and bioinformatics as a discipline. While tools for sequence alignment, based on the Smith-Waterman algorithm, had been available for a long time, BLAST changed the landscape largely by its speed and accuracy in searching for sequence homologs. Our proposed research has the potential to greatly accelerate the discovery of non-coding genes. The idea of pre-filtering the database for ncrna has also been explored recently. Weinberg and Ruzzo, 2004, described filters based on Markov models, which can provably retain all hits that a covariance model could find. In contrast, our filters are based on secondary structure conservation. Therefore, they allow compensatory mutations that might reduce primary sequence similarity. Our results compare favorably against their tool (see supplementary data bchaas/fastr/). We applied FastR to the discovery of novel riboswitches, which are a class of RNA domains found in the UTR regions. They are of interest because they regulate metabolite synthesis by directly binding metabolites. We searched all available eubacterial and archaeal genomes (508Mb) for riboswitches from purine, lysine, thiamin, and riboflavin sub-families. Our results point to a number of novel candidates for each of these sub-families, and include genomes that were not previously known to contain riboswitches. As an example, a search with the purine riboswitch (Z / ) took 19 hrs on a standard PC and resulted in the discovery of 180 homologs, including 33 of 35 known riboswitches. 9 of these are of interest as they lie less than 500bp upstream of a gene involved in Purine metabolism. Thus, FastR is a viable tool for discovering novel homologs of ncrna. We describe details of the FastR algorithm in the Methods section. In the Results and 5

6 Discussion section, the algorithm is validated by testing its speed and accuracy on known ncrna sub-families. Finally, we describe our findings from searches of riboswitches. 2 METHODS 2.1 Filters Key to our efficient searching for homologs is the notion of filters for RNA structure that allow a large fraction of the database to be discarded while retaining most of the true homologs. The obvious question is whether sequence similarity with the query string is sufficient to get an initial set of candidate regions. To test this, we queried the whole genome of A. pernix (GenBank NC ) with an Asn-tRNA sequence. With default parameters, BLASTN selected 4 hits with an E-value < 0.001, and 24 hits with E-value < of the 4, and 10 of the 24 matched the 43 hits produced by RSEARCH. Most of the alignments were less than 20bp in length and would have been discarded. Another example is presented in Figure 1. The alignment of two trna sequences (Acc#: X / and AF / ), from Drosophila, shows complete conservation of structure, but low sequence similarity. From this, and similar tests not included here, we do not anticipate a tool based on sequence similarity to be effective in finding RNA homologs. Therefore, we turn to the secondary structure of the query RNA sequence as the basis for our filter design. We will continue to use sequence similarity in computing the final alignments. As shown in Fig. 2, the secondary structure of an RNA has a tree like shape and can be decomposed into various loops (Interior loops, bulges, multi-loops) and stack regions. Each stem in this tree contains energetically favorable stacked base-pairs. The stacks are stabilized by hy- 6

7 GCAUCGGUGGUUCAGUGGUAGAAUGCUCGCCUGCCACGCGGGCG <<<<<<<..<<<<...>>>>.<<<<<...>>>>>.. UCUAAUAUGGCAGAUU...AGUGCAAUAGAUUUAAGCUCUAUAU GCCCGGGUUCGAUUCCCGACCGAUGCA..<<<<<...>>>>>>>>>>>>. AUAAAGU.AUUUU.ACUUUUAUUAGAA Figure 1: Alignment of two trna sequences from Drosophila melanogaster (trna Gly, Acc#: X /115-45) (top) and Drosophila simulans (trna Leu, Acc#: AF / ) (bottom). The two molecules have identical secondary structure (there are 4 stacks, and two same-colored blocks form a stack), but very low sequence similarity (only 4 bases are matched in stacked region). Note that these are diverged members of a large super-family. However, they underscore the need for structure based alignments. bulge stack a b A C C G G U G G U C a multi loop g hairpin loop e e b d c c g d f f h Internal loop h pseudo knot a a b b c d e e d f h f c g h g k w <= w Figure 2: (a) An RNA structure with various structural elements including stacked base-pairs, bulges, hairpin, and multi-loops. (b) An alternative view. The set of bases in (a, a ) forms a (k, w)-stack. drogen bonds between base-pairs. The Watson-Crick base-pairing (A U, C G) is energetically the most favorable, but other pairings such as the wobble base-pair (G U) are possible as well. Figure 2(b) provides a stretched view of the RNA structure. Each stack corresponds to a pair of sub-strings. These pairs are typically non-interleaving. While interleaved stacks, or pseudo-knots (such as the pair (f, f ), and (h, h ) in Figure 2) do occur, they can be ignored for filtering purposes. Consider a nucleotide string s, with s = n. We define a (k,w)-stack as a pair of indices (i, j), i < j if (j i) w, and s[i... i + k 1], and s[j... j + k 1] can form an energetically 7

8 favorable base-pair stack. As an example, the indices of the substring a, a ) in Figure 2 form a (5, w)-stack if they are at most w bases apart. A simple filter choice for an RNA structure is the set of all starting positions i which contain a (k, w)-stack for appropriately chosen k and w. Let p be the probability that a pair of randomly chosen bases is part of a stack. The probability that a pair of indices (i, j) with (j i) w forms a (k, w)-stack is p k. Define X i,j as the indicator variable with X i,j = 1 if and only if (i, j) forms a (k, w)-stack. Using linearity of expectation, the expected number of hits in a random string of length n is E( n i+w i=1 j=i+k X i,j ) = n i+w i=1 j=i+k E(X i,j ) nwp k See Table 1 for the expected number of hits per starting position ( wp k ). Obviously, for large k and small w, even this simple filter can be quite powerful. Assume for exposition purposes that the base-pairing is limited to the Watson-Crick base (A U, C G), and the wobble base-pair (G U). For a randomly and uniformly chosen pair of bases, the probability p of pairing is p = 3. 8 As an example, typical trna structures have a clover-leaf shape with the outermost stem having a 7 base-pair stack separated by about 70 bases. The (7, 70) filter would eliminate over 90% of the starting positions from consideration. In fact, we can do better as this base-pair unit is in fact separated by at least 50 bases in all trna, therefore making w effectively 20 (50 w 70), eliminating 98% of the starting positions. 2.2 Filter Design We will use the (k, w)-stack as the basis for our filter design. However, we need to design more sophisticated filters as indels may sometimes disrupt base-pair stacks (decreasing the effective value of k), and variability in separation may increase the effective value of w. We quantify 8

9 w\k Table 1: Expected number of hits in a random string in a (k, w)-filter. some design goals for filters to evaluate different designs, spur further research in this area. A good filter must be efficient. The time to filter should be no more than the time to align and score the filtered hits, and preferably as small as possible. Additionally, the filters must have high sensitivity, and specificity. Sensitivity is described as the fraction of all members of the ncrna family that is admitted by the filter, and should be as close to 1 as possible. It may be acceptable to work with lower sensitivities, for example, to look for members in a sub-family. We define specificity as the expected number of hits per base-pair, and should be as small as possible. Finally, the filters must be general, and simply described, so as to be applicable (with appropriate parameter tuning) to every ncrna family. We propose the following Nested and Multiloop filters: Nested filters: Considering the RNA secondary structure as a tree, and going depth first down a path (see Fig. 2), we have many nested (k, w)-stacks. The term (k, w, l)-nested stack is used to define l (k, w)-stacks in a nested configuration. For example, in Fig. 2, the configuration (a, a ), (c, c ), (d, d ), (e, e ), is a (k, w, 4)-nested stack. A (k, w, l)-nested stack may be overlapping if the different (k, w)-stacks are allowed to overlap, and non-overlapping otherwise. Overlapping nested stacks are another useful way of dealing with indels in the homolog structure. 9

10 Parallel and Multiloop stacks: Yet another way of looking at RNA structural elements is to locate non-nested, non-overlapping (k, w)-stacks. We define a (k, w, l) parallel stack as a configuration of l non-nested, non-overlapping (k, w)-stacks. This definition can be extended to a multiloop stack. A (k, w, l)-multiloop stack is a configuration in which a (k, w, l 1)-parallel stack is enclosed by a (k, w)-stack. The units (b, b ), (d, d ), (f, f ), and (g, g ) in Fig. 2 form a (k, w, 4)-parallel stack. Correspondingly, {(a, a ), (b, b ), (d, d ), (f, f ), (g, g )} is a (k, w, 5)-multiloop stack. The nested, parallel, and multiloop stacks are all generalizations of the (k, w)-stack, and therefore applicable to all families of ncrna. There are conserved structural elements in every ncrna family that enforce the correct folding, so it should be possible to find multiloop and nested structures with high sensitivity. Also, the simple description allows us to compute specificity using combinatorial techniques. To increase the specificity of these filters, we need to extend the design to include distance constraints (number of base-pairs) in between the various (k, w)stacks. For a filter with l (k, w) stacks, there are 2l substrings of length k each with 2l 1 distances between adjacent substrings. To this, we add an additional distance between the first and the last substring, and we have a vector of 2l distances. We constrain the distances by a 2l-dimensional vector w containing the allowed ranges for each of these distances. Choose w 0 to be the range of distances between the first and last substring, and w j, j > 1 to be the range of distances in the substrings ordered from left to right. A (multiloop/nested) filter satisfying these constraints is a (k, w, l)-filter. Note that (k, w, l)-multiloop stack can be redefined by choosing w such that w j = (0, w) for all j. A (4, w, 4)-multiloop stack for trna with appropriate distance constraints is shown in Fig

11 Figure 3: A (k, w, 4)-multiloop stack for trna with distance constraints, with w = [(50, 75), (3, 7), (3, 15), (0, 7), (5, 15), (0, 7), (5, 15), (3, 7)] Specificity and Sensitivity: To compute the specificity of a (multiloop or nested) filter, we address the following combinatorial problem: What is the probability that an arbitrary position in the random database is the start of a (k, w, l)-multiloop stack, or nested stack? In general, this is hard to compute because of the various dependencies between overlapping units, so we approach it indirectly. Consider a 2l dimensional vector v. If the distances in v are within the range specified by w, then v denotes a configuration of a (k, w, l)-multiloop stack obtained by fixing the 2l positions of the l (k, w)-stacks using distances in v. The probability of occurrence of an arbitrary configuration is exactly p kl. For an arbitrary starting position, and a configuration v, define an indicator variable X v = 1 if a (k, w, l)-multiloop stack occurs with configuration v 0 Otherwise Let Y = v X v. We are interested in computing P r[y > 0]. By linearity of expectation, E(Y ) = v E(X v ) = n k, w,l p kl, where n k, w,l is number of possible configurations of a (k, w, l)- multiloop stack. n k, w,l can be computed using standard combinatorial arguments. We consider two special cases: 11

12 1. Let 0 w j w for all j. Then, ( ) w 2(k 1)l 1 n k, we,l = 2l 1 2. Let 0 w 0, and for all j > 0, let 0 w j < x. Then, n k, w,l = x 2l 1. Ideally, we choose the distance constraints so that n k, w,l p kl << 1. For those values, we can use the Markov inequality (P r[y > 0] < E(Y )) to get the desired bound. For higher values of E(Y ), we need other techniques to bound the probability. These computations allow us to quantify the sensitivity-specificity trade-off due to a change in the distance constraints. Further increases in specificity are obtained by using intersections of nested and multiloop stacks. In Section 3, we describe our filtering results on various test cases. Informally, making a filter restrictive increases specificity at the cost of sensitivity. However, in most families of interest, we can design effective filters that reduce the database size by two orders of magnitude. 2.3 Optimal Filter Design Given a family R of ncrna sequences, an ideal (nested or multiloop) filter would seek to minimize n k, w,l p kl (increase specificity) while admitting a large fraction of the members (sensitivity) and allow efficient filtering. Initial tests on the purine filters resulted in a ten fold improvement in total running time with no loss of sensitivity. We will describe our results on optimal filter design elsewhere. As the input to FastR is a single query ncrna, we employ a dynamic programming algorithm that automatically generates Nested and Multiloop filters with high specificity. The algorithm takes advantage of the tree like structure of RNA. It iterates over every value of k, l. For each such pair of values, and every node v in the tree, it checks if a (k, l)-nested (multiloop) filter is 12

13 possible. The final filter chosen is one that maximizes kl, while keeping k as low as possible. The software then allows users to tweak the computed parameters to get the desired sensitivity while retaining specificity. However, our results test results show that the automatically generated filters have sensitivity that is comparable to the fine tuned filters. 2.4 Filtering Algorithms Filtering speed is critical to fast homolog computation. We use a combination of string matching and dynamic programming techniques (See for example [9]) to filter databases with multiloop and nested filters. 1. Hash: Build a hash table to compute all k-mer positions in the database. The time taken is O(m), where m is the size of the database. 2. Identify (k, w)-stacks: Let s i denote the k-mer at an arbitrary position i in the database. For each s i, compute a neighborhood N(s i ) of all complementary k-mers. To identify (k, w)-stacks efficiently, we use the hash table to compute all positions j such that s j N(s i ), and j i satisfies distance constraints. With appropriate data structures, the time taken is linear in the number of (k, w)-stacks, which is typically smaller than the size of the database. 3. Filters: Note that multiloop and nested filters are combinations of (k, w)-stacks. We scan the database with a moving window of size w. An active list of (k, w)-stacks within the window is maintained, and a dynamic programming technique is used to compute filters from this list. The total computation is bounded by O(m k w), where m k is the number of (k, w)-stacks. Typically m k < m. w 13

14 In general, any k-mer that can form an energetically favorable stack with s should be in N(s k ). In our current implementation, we do not allow indels, and allow at most 2 G U pairs. To test this, a scan of all of the structures in Rfam 5.0 [8] showed that at least 93% of all stacks contain an ungapped base-pairing of at size at least 4. Note that even the absence of an ungapped stack does not preclude the formation of a filter using other stacks in the same molecule. Therefore, this is a reasonable choice that does not affect sensitivity too much. The current filter time is a few seconds per Mb of sequence, which is easily dominated by the time for computing alignments. Also, the filters are very effective in eliminating a large fraction of the database while retaining most of the true hits. 2.5 Computing RNA Sequence Alignment The filtered database substrings must be structurally aligned to the query to identify true homologs. We focus on both, correctness (distinguishing true hits from false positives), and on efficiency. The problem has been well studied in the literature, with scoring based on a Nussinov like counting model [2, 24], and probabilistic models such as Covariance Models and Stochastic Context free grammars for RNA [6, 23]. It is also possible to extend the Zuker-Turner thermodynamic model [11, 31] for scoring sequence structure alignments, but that has not been yet been done. Here, we extend the approach from Bafna et al., 1995, to include a new binarizing procedure, banded alignment for efficient computation, and more realistic score functions. We use the scoring matrix (RIBOSUM) from Klein and Eddy [12] and empirically generated affine gap penalties to score the alignments. We note that our filtering approach generates candidates which can be used in conjunction with any alignment method. However, we use the extra information from the filter match to speed up alignment computation using banding techniques. 14

15 Similar to sequence alignment, an alignment A of two RNA strings s and t (with lengths m and n respectively) can be described by a matrix of two rows. The first row A[1, ] contains the string s with gaps, and the second row A[2, ] contains the string t with some interspersed gaps. Each column has at most one gap in it. We score for both sequence and structure using the function δ(a[1, i], A[1, j], A[2, i], A[2, j]). In this function A[1, i], A[1, j] (likewise A[2, i], A[2, j]) are scored for complementarity (conserved structure), and the pairs (A[1, i], A[2, i]) and (A[1, j], A[2, j]) are scored for sequence conservation. Additionally, we score each column that does not participate in base-pairing, by a function γ(a, b) that measures sequence conservation, for all a, b {A, C, G, U, }. Alignments are scored by summing up the contributions of sequence and structural alignments. A schematic algorithm for aligning an RNA query against a sequence is given in Fig. 5. Note that this algorithm uses linear gap penalties. In our implementation, we use a slightly more sophisticated affine gap function (omitted in Figure 5 for exposition). Our alignment is local in the subject sequence (there is no penalty for aligning ends of the sequence), but global in the query sequence (the entire query must be aligned). A naive algorithm would iterate over all pairs of intervals in s, and t. We can do better by exploiting the structure of s. The structure of an RNA sequence s is a set S of base-pairs, where (i, j) S implies that s[i] bonds with s[j]. Ignoring pseudo-knots, each base-pair has a unique enclosing base-pair, thus S can be shown to be a tree with each node denoting a base-pair, and the obvious parent-child relation. First, we augment the tree (See algorithm and an illustration in Fig. 4) by adding spurious base-pairs so that each nucleotide (originally base-paired or not) is in some base-pair, each node has at most two children, and the number of nodes is O(m), where s = m. Additionally, each node v S has at most one child in the augmented structure, denoted by S. 15

16 a b a b c c i f e d g j (a) h k i e d f g j (b) h k procedurebinarize(i,j) (* Binarize the interval (i, j). *) if (i = j) return (create node(i,j,dotted,nil)); (* A dotted node with 0 child. *) if (i, j) S v = Binarize(i+1,j-1); return (create node(i,j,solid,v)); (* A solid node with 1 child v. *) if (k, j) S for some i < k < j vl = Binarize(1,k-1); vr = Binarize(k,j); return (create node(i,j,dotted,vl,vr)); (*A dotted node with 2 children, vl and vr. *) if (i < j) v = Binarize(i,j-1); return (create node(i,j,dotted,v)); (* A dotted node with 1 child v. *) end if (c) Figure 4: Procedure to create a Binary tree with O(m) nodes such that each node has at most 2 children. (a) The solid edges correspond to pairs in S, while the dotted edges correspond to augmented spurious edges. (b) A binary tree representation for (a), by changing solid edges into solid nodes and dotted edges into void nodes. (c) The Binarize procedure. 16

17 procedure alignrna (*S is the set of base-pairs in RNA structure of s. S is the augmented set. *) for all intervals (i, j), 1 i < j n, all nodes v S if v S A[i + 1, j 1, child(v)] + δ(t[i], t[j], s[l v ], s[r v ]), A[i, j 1, v] + γ(, t[j]), A[i + 1, j, v] + γ( A[i, j, v] = max, t[i]), A[i + 1, j, child[v]] + γ(s[l v ], t[i]) + γ(s[r v ], ), A[i, j 1, child[v]] + γ(s[l v ], ) + γ(s[r v ], t[j]), A[i, j, child[v]] + γ(s[l v ], ) + γ(s[r v ], ), else if v S S, and v has one child A[i, j, v] = max A[i, j 1, child[v]] + γ(s[r v ], t[j]), A[i, j, child[v]] + γ(s[r v ], ), A[i, j 1, v] + γ(, t[j]), A[i + 1, j, v] + γ(, t[i]), else if v S S, and v has two children A[i, j, v] = max i k j {A[i, k 1, left child[v]] + A[k, j, right child[v]]} end if end for Figure 5: An algorithm for aligning a query RNA s of length m with a database string t of length n. The query structure S has been Binarized to get S. The index pair in s corresponding to each node v S is denoted by (l v, r v ). We limit the intervals in s to nodes v S, which are bounded by O(m). Figure 5 describes the algorithm for aligning sequence t against sequence s (with known structure). Each node v in the tree structure of s is aligned against each interval (i, j) of t. Suppose v S and let l v and r v denote the indices of the left and right end-points of v. If, for example, s[l v ] = t[i] and s[r v ] = t[j], then clearly A[i, j, v] = A[i + 1, j 1, child(v)] + δ(t[i], t[j], s[l v ], s[r v ]) If, on the other hand, v S S, and has 2 children, then we need to iterate over all k such that the right child(v) can align with the interval (k, j). The procedure alignrna (Figure 5) describes the dynamic programming algorithm to handle all the cases. Let m 1, and m 2 = m m 1 denote the number of nodes with 1 and 2 children respectively. The complexity of alignrna, with a query of length m and a target of length n, is O(n 2 m 1 + n 3 m 2 ). This parameterization is useful 17

18 because in typical structures m 2 << m. In our case, the complexity can be further reduced. The sequence pairs that need to be aligned have been filtered for an underlying sub-structure. The preliminary alignment obtained by this filter allows us to limit the nodes in S that can be applied to a position i in t, based on the left end-point of v, and the width. This banding reduces the number of nodes to a constant, effectively making the complexity O(n 2 δm), 2 where δ m << m is the size of the banded region. The banding forces a tradeoff. Overlapping hits from the filter can either be aligned independently with a tight band, or merged and aligned once with larger band size. 2.6 P-value For an effective database search, we need to have p-values for the probability that a hit was obtained by chance. Klein and Eddy make the argument that the distribution of scores of RNA structural alignments follow the Gumbel distribution. As this is a strong assumption, and determination of a true p-value is a challenging research problem. Therefore, we choose to express the p-value by using the non-parametric Chebyshev s inequality. To obtain the mean and variance, the query is aligned against randomly generated sequence with a similar GC content as the database after each query. The bound provided by this inequality is conservative, and over-estimates the probability of obtaining a similar score by chance. We have found that a cut-off of 0.03 is a reasonable value in practice. 3 RESULTS We describe the results on filtering and alignment independently before giving combined results. To test our algorithms, we worked with ncrna sub-families of known/predicted structure from 18

19 the Rfam [8] and the 5S Ribosomal RNA database [25]. Four sub-families are considered here, trna, 5S rrna, the hammerhead ribozyme, and 4 riboswitches, purine, lysine, thiamin, and riboflavin [27, 29]. Of these, trna and rrna are well-studied sub-families. Most genome annotations include screening and annotation for trna. The different riboswitches are of great interest because they regulate metabolite (nucleic-acids, amino-acids, vitamins) synthesis by direct binding to metabolites. In subsequent tests, we search the entire complement of eubacterial and archaeal genomes for novel riboswitches. For every sub-family, we chose representative members, inserted them in a random database of size 1Mb, and tested our algorithms on the composite sequence. The probability of finding stacks at random depends on the GC content, so in some cases, the random database was created by first choosing the GC-content, and subsequently generating bases with appropriate fixed probability. G + C probabilities of 0.35, 0.5, and 0.75 were chosen to study the effect of GC-content. All experiments were performed on an Intel PC (3.4GHz, 1Gb RAM), running Linux. 3.1 Filtering for ncrna Table 2 describes results of applying various filters. As expected, as the filters become more stringent (higher k, l, less variable distances), the number of false negatives increases. However, for each family, there exist appropriate filters that filter out a large portion of the database while retaining most of the members of the family. Also, as the GC-content is biased away from 0.5, the number of false hits increases. The false negatives are all explained by one of 3 possibilities: (a) The proposed structure contains non-canonical base-pairs, which are not allowed by FastR. For example, 10/100 trna sequences contain A G, A A, or A C base-pairs. (b) One of the (k, w) stacks is missing due to indels, mismatches or short stacks. (c) Distance constraints are not 19

20 ncrna GC k l #Hits (/Mb) True Pos. /Tot. Non canonical pairing Missing (k, w)-stack Deviant w trna / trna / trna / S rrna / S rrna / Hammerhead 0.5 4(*) / Purine-Rs / Thiamin-Rs 0.5 4(*) / Lysine-Rs / Riboflavin-Rs 0.5 3(*) / Table 2: The results of applying nested and multiloop filters (with various parameters) to random databases that contain true positives. As the filters become more stringent, the number of hits decrease, and the number of false-negatives increase. In all but (*) cases, at most 2 G U basepairs are allowed in a stack. (*) refers to cases where only 1 G U base-pair is allowed by the multiloop filter. The false negatives are due to non-canonical base-pairing, small stacks (k = 3), or the distance constraints being out of range, as described in the last 3 columns. satisfied. In ongoing work, we plan to change the neighborhood computation to include all k-mers that form low energy stacks according to the Zuker-Turner [11] thermodynamic considerations. With respect to varying distance constraints, one has to choose the correct speed/sensitivity trade-off. The filters we have selected are all fast, and lead to very few hits in the random database. This number will increase as the distance constraints are increased. 3.2 Alignment To test alignment quality, we computed alignments on a randomly generated 300Kb database sequence with set of ncrna sequence for each family inserted in it. No filtering was used for FastR. Figure 6 shows ROC plots for the two alignment algorithms. The two are comparable, 20

21 Query Hits (TP/Tot) Filtered Hits Time RSEARCH 85/ s FastR Asn-tRNA (AE / ) 77/ s RSEARCH 97/ s FastR 5S rrna (AE / ) 80/ s RSEARCH 50/ s FastR Hammerhead (M /56-3) 47/ s RSEARCH 34/ s FastR Purine-Rs (Z / ) 33/ s RSEARCH 32/ s FastR Lysine-Rs (Z / ) 28/ s RSEARCH 109/ s FastR Thiamin-Rs (Z / ) 71/ s RSEARCH 41/ s FastR Riboflavin-Rs (L / ) 31/ s Table 3: Comparison of FastR and RSEARCH. A p-value cutoff for FastR, 0.05, was chosen for FastR that approximately matched the total number of hits in RSEARCH with cutoff E-value of 10. The hits column refers to the number of true positives out of the total hits found. The filtered hits column represents the number of true positives passed the filters. No filtering is used for RSEARCH. with RSEARCH performance better for distant homologs. The time taken for FastR (banded) trna alignment is 3m48s, compared to 20m42s for RSEARCH. Finally, we evaluate FastR after combining filtering and alignment. We randomly select the query for each family is from Rfam seed alignment and search the random sequences using FastR and RSEARCH. Table 3 summarizes the results of our search. Similar results are achieved when repeating the tests with different queries. As can be seen, FastR is close to two orders of magnitude faster than RSEARCH while maintaining comparable sensitivity. Note that there is little or no change in sensitivity due to banding. Much of the loss of sensitivity is due to filtering. 21

22 #TP RSEARCH FastR #TP RSEARCH FastR #FP trna #FP 5S rrna #TP RSEARCH FastR #TP RSEARCH FastR #FP Hammerhead #FP Purine-Rs #TP RSEARCH FastR #TP RSEARCH FastR #FP Lysine-Rs #FP Thiamin-Rs #TP #FP Riboflavin-Rs RSEARCH FastR Figure 6: ROC plots for the alignments generated by RSEARCH and FastR. Alignments were tested using a 300kb random sequence with a set of true ncrnas inserted in it. The X-axis represents the number of false-positives and the y-axis represents the number of true positives. The horizontal line represents the number of true hits in the random sequences. 22

23 As seen in the previous section, FastR alignments and scores are good for the high quality hits, but decrease thereafter leading to a loss of sensitivity. As expected, much of the loss of sensitivity can be attributed to filtering. For 5s rrna, the filter allows 80 of the 100 true positives, which are almost completely retrieved by FastR. In contrast, RSEARCH gets the top 97 but needs two orders of magnitude more time, making it much harder to conduct large scale searches. It should also be pointed out that many of the true positives were initially discovered using covariance models which are not unlike the model used by RSEARCH. As a final validation of the FastR algorithm, we apply it to the discovery of novel members of Riboswitches. Our results point to a number of interesting findings. 3.3 Riboswitches Riboswitches are cis-regulatory elements typically found in the 5 untranslated region of the gene they regulate. To date, six such motifs have been identified that control the anabolism of three vitamins (riboflavin, thiamin, cobalamin), as well as the biosynthesis of methionine, lysine and purine [18, 22, 27, 29]. Similar to previously characterized RNA regulatory structures, each riboswitch is capable of folding into a consensus structure which may result in either transcription attenuation or translation inhibition. However, the riboswitch element is unique in that it binds directly to ligands and is therefore able to sense the level of cellular metabolites without the need of trans-acting protein factors. It is believed that this class of ncrna appeared early in evolution, and accordingly, riboswitch elements have been found in a wide range of bacterial species. The vitamin riboswitches are the most diverse and can be identified in archaea and eubacteria. In particular, the thiamin riboswitch has been characterized in fungi and plants such as rice and arabidopsis [27]. Conversely, 23

24 the methionine, lysine, and purine riboswitches are more common in gram-positive bacteria. The repression mechanism is also biased by a bacterium s phylogeny. Gram positive bacteria typically prefer transcription termination; whereas, gram-negative micro-organisms tend to mediate gene repression by inhibiting translation. While riboswitches are ubiquitous, homologs show little sequence similarity. Even in the most conserved regions, typically for ligand binding, the sequence identity may be less than 7 nucleotides. We used FastR to search the entire complement of bacterial and archaeal genomes with queries from purine, thiamin, lysine and riboflavin riboswitches. A data set of non-redundant, known riboswitches existing within our genome files was assembled from the Rfam database. This data set was used to determine the p-value cutoffs, and a single member used as the query sequence. A total of 245 genomes comprising 508 Mb were searched in both strands. Candidate riboswitch sequences generated by FastR were filtered in order to find the best predictions. First, known riboswitches from the Rfam database, low-complexity and AT-rich predictions were discarded. The remaining predictions were filtered by their distance from the 5 start of an exon. Finally, the predictions were manually examined to determine if the downstream gene was biologically relevant. The results are summarized in Table 4 (for complete results and analysis see supplementary data bchaas/fastr/). We focus on the 18 most promising hits, even though the remaining hits are likely to contain many interesting candidates. See Table 5. These predictions represent either those elements which are upstream from genes involved in the metabolic pathway under regulation, or predictions with strong sequence similarity in regions thought to mediate ligand binding. 6 of the 9 putative purine riboswitches are found in the 5 UTR of either the xanthine transport protein, xanthine phosphoribosyltransferase, purine nucleotide phosphorylase, adenine deaminase, or GMP synthase. Moreover, prediction 2 (gi ) lies upstream from a hypothetical protein 24

25 Riboswitch Query Time hrs.(secs./mb) P value cutoff a Novel hits b Filtered hits c Purine Z / (67.7) Lysine Z / (162.23) Thiamin Z / (148.71) Riboflavin L / (80.94) Table 4: Summary of the FastR riboswitch search. (a) The p-value cutoffs were determined from alignments of known riboswitches. (b) Number of novel hits returned by FastR after removing the annotated hits in Rfam database. (c) Number of novel hits after removing annotated hits in Rfam database and filtering for low-complexity, AT-content, and distance from a gene. with homology to the xanthine permease family. This observation highlights a hidden value in identification of riboswitches - the possibility of assigning annotations to genes of unknown function. Similarly, of the 7 reported lysine riboswitch predictions, 5 are located upstream of genes encoding an amino acid permease, diaminopimelate decarboxylase, dihydrodipicolinate synthase, or lysine specific permease. The final predictions for the riboflavin and thiamin riboswitches are found upstream of genes encoding diaminohydroxyphosphoribosylaminopyrimidine deaminase and phosphomethylpyrimidine kinase, respectively. Of the 16 novel purine and lysine riboswitch predictions, 13 are from gram-positive bacteria, supporting earlier conclusions. 4 of the 7 novel purine hits were to Lactobacillus johnsonii and Lactobacillus plantarum, which have no previously identified purine riboswitches. Likewise, 4 lysine predictions and 1 riboflavin prediction are from genomes with no previous riboswitches from that family. While none of these predictions appears in Rfam, it has been brought to our attention that some of these predictions overlap with the predictions in [22]. As these were made using completely different techniques, they provide additional validation of our approach. Free energy minimization approaches to secondary structure prediction are not well suited to 25

26 Riboswitch Genome Location a p-value D b Gene Annotation Bacillus anthracis (+) GMP synthase Lactobacillus johnsonii* (+) Hypothetical protein (xanthine permease family) Lactobacillus plantarum* (+) Xanthine / uracil transport protein Lactobacillus plantarum* ( ) Adenine deaminase Purine Lactobacillus johnsonii* ( ) Xanthine phosphoribosyltransferase Bacillus anthracis (+) Conserved hypothetical protein Clostridium perfringens (+) Purine nucleoside phosphorylase Bdellovibrio bacteriovorus ( ) Hypothetical protein Bacillus cereus ( ) Adenine deaminase Bacillus subtilis ( ) ABC transporter (amino acid permease) Bacillus halodurans (+) Diaminopimelate decarboxylase Fusobacterium nucleatum* ( ) Hypothetical protein Lysine Onion yellows phytoplasma* ( ) ABC-type amino acid transport system Lactococcus lactis* (+) Lysine specific permease Lactococcus lactis* ( ) Dihydrodipicolinate synthase Shewanella oneidensis ( ) Hypothetical Na+/H+ antiporte Riboflavin Thermus thermophilus* ( ) Diaminohydroxyphosphoribosylaminopyrimidine deaminase family Thiamin Streptococcus pneumonia ( ) Phosphomethylpyrimidine kinase Table 5: Description of the 18 most promising candidates from the 468 putative riboswitches discovered by FastR. Each predicted riboswitch is either upstream from a biologically relevant gene, or contains strong sequence similarity in regions thought to mediate ligand binding. (a) Genome coordinates of the riboswitches. + and refer to the strand. (b) Distance between the start of the riboswitch and the 5 end of an exon. (*) Genomes with no previously identified riboswitches from that family. 26

27 riboswitches, because the repressing structure is contingent upon ligand binding. FastR offers an advantage for such RNA motifs in that the biologically significant structure can be inferred from the alignment. The secondary structures derived from the top predictions in each riboswitch family in Table 5 can be seen in Figure 6. 4 DISCUSSION Our results show that FastR is an effective tool for finding novel homologs of query ncrna sequences. In general, the development of fast filtering and searching tools for ncrna is a natural area of research, analogous to the development of sequence similarity tools like BLAST, and Fasta. However, as the discussion above shows, the underlying structure and diversity of ncrna makes this problem quite different in character. Consequently, the filters must be more complex than the (approximate) keyword matches used for sequence similarity. The ideas presented here open many lines of research, which we are actively pursuing. The first is regarding sensitivity. For diverged families, the filters miss out a few true homologs. Our analysis showed that in many cases, this was due to a stem loop not being recognized as a (k, w)-stack. This can be due to too few base-pairs, bulges, and non-canonical base-pairing. However, the stem must still have low-energy that allows it to maintain that conformation. Therefore, we plan to generalize the definition of a (k, w)-stack allowing all pairs that form energetically favorable structures. While non-canonical base-pairs are easy to handle, bulges and interior loops are computationally more challenging. It will be interesting to see how these changes affect sensitivity. Some homologs are filtered out because they do not satisfy distance constraints. Relaxing the distance constraints could decrease specificity. One approach to increasing sensitivity without compromising specificity is to relax the distance constraints, but 27

28 (a) (b) (c) (d) Figure 7: Representative riboswitch secondary structures derived from the alignments of the top novel hits for each query. (a) The secondary structure prediction for the top purine hit. (b) The secondary structure prediction for the top lysine hit. (c) The secondary structure prediction for the top thiamine hit. (d) The secondary structure prediction for the top riboflavin hit. 28

29 employ multiple nested and multiloop filters. While keyword matches are not good filters, Weinberg and Ruzzo, 2004, show that filters based on Markov Models can be very effective. These capture conservation in sequence, but not structure. However, they are constructed from covariance models in a way that ensures the same level of sensitivity as the CM. It will be interesting to combine the two filters to see how well they perform. Another direction is the design of optimal multiloop and nested filters. Currently, the filters were constructed by changing parameters empirically. We are working to automate the design of optimal (multi-loop and nested) filters for an ncrna family. Preliminary results on the purine riboswitch show a 10-fold speedup with no loss of sensitivity, and we are testing the methodology on other families. These filters are constructed when the secondary structure is known for all members of the ncrna family. In general, the secondary structure might not be known, and it would be interesting to automate the filter design after inferring common structural elements. This is a simpler and more tractable version of the well-studied problem of RNA multiple alignment because it only requires an alignment of the stacks involved in the filters. In many cases, the FastR results themselves need to be filtered to remove obvious false positives. However, the advantage of using the tool is that good candidates can be found with relatively little effort. If the Modern RNA world hypothesis is true, many ncrna sequences will be discovered in the coming years. Our tool can be used to rapidly identify novel homologs of these ncrna. Finally, many RNA motifs, including riboswitches, fold into the correct structure only in combination with other molecules. Programs that predict structure based on de novo energy minimization are challenged in their ability to find the correct structure for these molecules. In contrast, comparative tools such as ours can be used to infer structure, relatively easily. Similar to most Bioinformatics tools, the results of the search must be validated in the lab for 29

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006

98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 98 Algorithms in Bioinformatics I, WS 06, ZBIT, D. Huson, December 6, 2006 8.3.1 Simple energy minimization Maximizing the number of base pairs as described above does not lead to good structure predictions.

More information

In Genomes, Two Types of Genes

In Genomes, Two Types of Genes In Genomes, Two Types of Genes Protein-coding: [Start codon] [codon 1] [codon 2] [ ] [Stop codon] + DNA codons translated to amino acids to form a protein Non-coding RNAs (NcRNAs) No consistent patterns

More information

A Structure-Based Flexible Search Method for Motifs in RNA

A Structure-Based Flexible Search Method for Motifs in RNA JOURNAL OF COMPUTATIONAL BIOLOGY Volume 14, Number 7, 2007 Mary Ann Liebert, Inc. Pp. 908 926 DOI: 10.1089/cmb.2007.0061 A Structure-Based Flexible Search Method for Motifs in RNA ISANA VEKSLER-LUBLINSKY,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Chapter 16 Lecture. Concepts Of Genetics. Tenth Edition. Regulation of Gene Expression in Prokaryotes

Chapter 16 Lecture. Concepts Of Genetics. Tenth Edition. Regulation of Gene Expression in Prokaryotes Chapter 16 Lecture Concepts Of Genetics Tenth Edition Regulation of Gene Expression in Prokaryotes Chapter Contents 16.1 Prokaryotes Regulate Gene Expression in Response to Environmental Conditions 16.2

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

proteins are the basic building blocks and active players in the cell, and

proteins are the basic building blocks and active players in the cell, and 12 RN Secondary Structure Sources for this lecture: R. Durbin, S. Eddy,. Krogh und. Mitchison, Biological sequence analysis, ambridge, 1998 J. Setubal & J. Meidanis, Introduction to computational molecular

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Predicting RNA Secondary Structure

Predicting RNA Secondary Structure 7.91 / 7.36 / BE.490 Lecture #6 Mar. 11, 2004 Predicting RNA Secondary Structure Chris Burge Review of Markov Models & DNA Evolution CpG Island HMM The Viterbi Algorithm Real World HMMs Markov Models for

More information

Name Period The Control of Gene Expression in Prokaryotes Notes

Name Period The Control of Gene Expression in Prokaryotes Notes Bacterial DNA contains genes that encode for many different proteins (enzymes) so that many processes have the ability to occur -not all processes are carried out at any one time -what allows expression

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Genome 559 Wi RNA Function, Search, Discovery

Genome 559 Wi RNA Function, Search, Discovery Genome 559 Wi 2009 RN Function, Search, Discovery The Message Cells make lots of RN noncoding RN Functionally important, functionally diverse Structurally complex New tools required alignment, discovery,

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell.

Types of RNA. 1. Messenger RNA(mRNA): 1. Represents only 5% of the total RNA in the cell. RNAs L.Os. Know the different types of RNA & their relative concentration Know the structure of each RNA Understand their functions Know their locations in the cell Understand the differences between prokaryotic

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

From Gene to Protein

From Gene to Protein From Gene to Protein Gene Expression Process by which DNA directs the synthesis of a protein 2 stages transcription translation All organisms One gene one protein 1. Transcription of DNA Gene Composed

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable

RNA Basics. RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U. Bases can only pair with one other base. wobble pairing. 23 Hydrogen Bonds more stable RNA STRUCTURE RNA Basics RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U wobble pairing Bases can only pair with one other base. 23 Hydrogen Bonds more stable RNA Basics transfer RNA (trna) messenger

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters

Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters Computational Approaches for determination of Most Probable RNA Secondary Structure Using Different Thermodynamics Parameters 1 Binod Kumar, Assistant Professor, Computer Sc. Dept, ISTAR, Vallabh Vidyanagar,

More information

RNA Synthesis and Processing

RNA Synthesis and Processing RNA Synthesis and Processing Introduction Regulation of gene expression allows cells to adapt to environmental changes and is responsible for the distinct activities of the differentiated cell types that

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

Searching for Noncoding RNA

Searching for Noncoding RNA Searching for Noncoding RN Larry Ruzzo omputer Science & Engineering enome Sciences niversity of Washington http://www.cs.washington.edu/homes/ruzzo Bio 2006, Seattle, 8/4/2006 1 Outline Noncoding RN Why

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

V19 Metabolic Networks - Overview

V19 Metabolic Networks - Overview V19 Metabolic Networks - Overview There exist different levels of computational methods for describing metabolic networks: - stoichiometry/kinetics of classical biochemical pathways (glycolysis, TCA cycle,...

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

V14 extreme pathways

V14 extreme pathways V14 extreme pathways A torch is directed at an open door and shines into a dark room... What area is lighted? Instead of marking all lighted points individually, it would be sufficient to characterize

More information

Combinatorial approaches to RNA folding Part I: Basics

Combinatorial approaches to RNA folding Part I: Basics Combinatorial approaches to RNA folding Part I: Basics Matthew Macauley Department of Mathematical Sciences Clemson University http://www.math.clemson.edu/~macaule/ Math 4500, Spring 2015 M. Macauley (Clemson)

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON PROKARYOTE GENES: E. COLI LAC OPERON CHAPTER 13 CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON Figure 1. Electron micrograph of growing E. coli. Some show the constriction at the location where daughter

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Computational Methods For Analyzing Rna Folding Landscapes And Its Applications

Computational Methods For Analyzing Rna Folding Landscapes And Its Applications University of Central Florida Electronic Theses and Dissertations Doctoral Dissertation (Open Access) Computational Methods For Analyzing Rna Folding Landscapes And Its Applications 2012 Yuan Li University

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

RNA secondary structure prediction. Farhat Habib

RNA secondary structure prediction. Farhat Habib RNA secondary structure prediction Farhat Habib RNA RNA is similar to DNA chemically. It is usually only a single strand. T(hyamine) is replaced by U(racil) Some forms of RNA can form secondary structures

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Towards a Comprehensive Annotation of Structured RNAs in Drosophila

Towards a Comprehensive Annotation of Structured RNAs in Drosophila Towards a Comprehensive Annotation of Structured RNAs in Drosophila Rebecca Kirsch 31st TBI Winterseminar, Bled 20/02/2016 Studying Non-Coding RNAs in Drosophila Why Drosophila? especially for novel molecules

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Detecting non-coding RNA in Genomic Sequences

Detecting non-coding RNA in Genomic Sequences Detecting non-coding RNA in Genomic Sequences I. Overview of ncrnas II. What s specific about RNA detection? III. Looking for known RNAs IV. Looking for unknown RNAs Daniel Gautheret INSERM ERM 206 & Université

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Gene regulation II Biochemistry 302. February 27, 2006

Gene regulation II Biochemistry 302. February 27, 2006 Gene regulation II Biochemistry 302 February 27, 2006 Molecular basis of inhibition of RNAP by Lac repressor 35 promoter site 10 promoter site CRP/DNA complex 60 Lewis, M. et al. (1996) Science 271:1247

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 1 GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications 2 DNA Promoter Gene A Gene B Termination Signal Transcription

More information

Regulation of Gene Expression at the level of Transcription

Regulation of Gene Expression at the level of Transcription Regulation of Gene Expression at the level of Transcription (examples are mostly bacterial) Diarmaid Hughes ICM/Microbiology VT2009 Regulation of Gene Expression at the level of Transcription (examples

More information

A Method for Aligning RNA Secondary Structures

A Method for Aligning RNA Secondary Structures Method for ligning RN Secondary Structures Jason T. L. Wang New Jersey Institute of Technology J Liu, JTL Wang, J Hu and B Tian, BM Bioinformatics, 2005 1 Outline Introduction Structural alignment of RN

More information

Sequence Analysis '17- lecture 8. Multiple sequence alignment

Sequence Analysis '17- lecture 8. Multiple sequence alignment Sequence Analysis '17- lecture 8 Multiple sequence alignment Ex5 explanation How many random database search scores have e-values 10? (Answer: 10!) Why? e-value of x = m*p(s x), where m is the database

More information

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0

More information

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17

RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 RNA-Strukturvorhersage Strukturelle Bioinformatik WS16/17 Dr. Stefan Simm, 01.11.2016 simm@bio.uni-frankfurt.de RNA secondary structures a. hairpin loop b. stem c. bulge loop d. interior loop e. multi

More information

Regulation of Gene Expression

Regulation of Gene Expression Chapter 18 Regulation of Gene Expression Edited by Shawn Lester PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley

More information

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained

Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained Copyright (c) 2007 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending an email to pubs-permissions@ieee.org.

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

Chapter 17 The Mechanism of Translation I: Initiation

Chapter 17 The Mechanism of Translation I: Initiation Chapter 17 The Mechanism of Translation I: Initiation Focus only on experiments discussed in class. Completely skip Figure 17.36 Read pg 521-527 up to the sentence that begins "In 1969, Joan Steitz..."

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012 Whole Genome Alignment Adam Phillippy University of Maryland, Fall 2012 Motivation cancergenome.nih.gov Breast cancer karyotypes www.path.cam.ac.uk Goal of whole-genome alignment } For two genomes, A and

More information

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka 819-0395, Japan. daisuke@inf.kyushu-u.ac.jp

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Revisiting the Central Dogma The role of Small RNA in Bacteria

Revisiting the Central Dogma The role of Small RNA in Bacteria Graduate Student Seminar Revisiting the Central Dogma The role of Small RNA in Bacteria The Chinese University of Hong Kong Supervisor : Prof. Margaret Ip Faculty of Medicine Student : Helen Ma (PhD student)

More information

DNA/RNA Structure Prediction

DNA/RNA Structure Prediction C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Master Course DNA/Protein Structurefunction Analysis and Prediction Lecture 12 DNA/RNA Structure Prediction Epigenectics Epigenomics:

More information

Gene regulation I Biochemistry 302. Bob Kelm February 25, 2005

Gene regulation I Biochemistry 302. Bob Kelm February 25, 2005 Gene regulation I Biochemistry 302 Bob Kelm February 25, 2005 Principles of gene regulation (cellular versus molecular level) Extracellular signals Chemical (e.g. hormones, growth factors) Environmental

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

Chapter 12. Genes: Expression and Regulation

Chapter 12. Genes: Expression and Regulation Chapter 12 Genes: Expression and Regulation 1 DNA Transcription or RNA Synthesis produces three types of RNA trna carries amino acids during protein synthesis rrna component of ribosomes mrna directs protein

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

+ regulation. ribosomes

+ regulation. ribosomes central dogma + regulation rpl DNA tsx rrna trna mrna ribosomes tsl ribosomal proteins structural proteins transporters enzymes srna regulators RNAp DNAp tsx initiation control by transcription factors

More information

Introduction to Molecular and Cell Biology

Introduction to Molecular and Cell Biology Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the molecular basis of disease? What

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 07: profile Hidden Markov Model http://bibiserv.techfak.uni-bielefeld.de/sadr2/databasesearch/hmmer/profilehmm.gif Slides adapted from Dr. Shaojie Zhang

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

The wonderful world of RNA informatics

The wonderful world of RNA informatics December 9, 2012 Course Goals Familiarize you with the challenges involved in RNA informatics. Introduce commonly used tools, and provide an intuition for how they work. Give you the background and confidence

More information

UNIVERSITY OF YORK. BA, BSc, and MSc Degree Examinations Department : BIOLOGY. Title of Exam: Molecular microbiology

UNIVERSITY OF YORK. BA, BSc, and MSc Degree Examinations Department : BIOLOGY. Title of Exam: Molecular microbiology Examination Candidate Number: Desk Number: UNIVERSITY OF YORK BA, BSc, and MSc Degree Examinations 2017-8 Department : BIOLOGY Title of Exam: Molecular microbiology Time Allowed: 1 hour 30 minutes Marking

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology 2012 Univ. 1301 Aguilera Lecture Introduction to Molecular and Cell Biology Molecular biology seeks to understand the physical and chemical basis of life. and helps us answer the following? What is the

More information