Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Size: px
Start display at page:

Download "Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences"

Transcription

1 Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka , Japan. daisuke@inf.kyushu-u.ac.jp Abstract This paper is devoted to considering mining infrequent patterns from biological sequences. Two typical approaches to find infrequent patterns are model-driven and data-driven, and each of them has advantages and disadvantages. As a mixed approach, FPCS (Finding Peculiar Composite Strings) was proposed in a literature, where two substrings x and y are decided by given data and their concatenation xy is evaluated in a model-driven manner. Although its effectiveness has already shown, it requires the background set of sequences, in addition to the target set. In this paper, we propose another approach for infrequent patterns, which, given a single set of sequences, finds string patterns of two substrings frequent in the set. Therefore, the proposed approach is simpler than FPCS. Using biological features, such as RNA, of popular bacterial DNA sequences, the effectiveness of the proposed approach is evaluated. For B. subtilis and C. perfringens, the proposed approach can find RNA regions as well as FPCS while it fails to do that for E. coli and S. enterica because FPCS is more finely granular than the proposed approach. Keywords: Under-represented patterns, Infrequent patterns, Text mining, Bioinformatics 1. Introduction With plenty of biological sequences, it is becoming much more important to develop mining algorithms for such sequences. As one of such mining algorithms, those for frequent or infrequent patterns, called over-represented or under-represented ones, respectively, have been attracted in bioinformatics [2]. Compared to frequent patterns, it is more difficult for mining algorithms to find infrequent ones from biological sequences because of data sparseness. That is, there exist a huge number of infrequent subsequences due to the sparseness, and thus it is critical to select useful patterns out of the huge number of infrequent candidate patterns. We can see existing methods as two basic approaches for infrequent pattern mining: one is model-driven in which a probabilistic model is assumed for being normal and a candidate pattern is under- or over-represented if its frequency is far from the expected frequency estimated from the model; the other is data-driven in which, given two sets of sequences, an algorithm outputs a pattern if it frequently appears in one of the sets while it rarely does in the other. A typical approach of the former types is the z-score [2]. The score for a pattern w is defined as z(w) = f(w) E(w), N(w) where f(w) is the frequency of w in a given set of sequences, E(w) its expected value of w under an assumed probabilistic model, and N(w) a normalization factor of w. As a probabilistic model, the Bernoulli model is assumed in [3], [4] and the Markov model is considered in [5], [6]. However, a simple model can not describe the given sequences well while a complicated one requires huge computational resources, and thus it is difficult to decide an appropriate model in advance. A typical approach for the latter types is the contrast or distinguish pattern. In this case, a background set is assumed to define being normal, in addition to a target set [7], [8]. However, this approach is basically for frequent patterns by eliminating useless candidate patterns which are both frequent in two sets. In [9], the algorithm called FPCS (Finding Peculiar Compositions) were proposed, where, given a target set T and a background set B of sequences, a pattern w is extracted as the form of w = xy if each of x and y is more frequent in B than in T and conversely w = xy is more frequent in T. More precisely, given two parameters θ T, θ B ( 1), a candidate pattern w is extracted if P(x B) > θ B P(x T ), P(y B) > θ B P(y T ), and P(xy T ) > θ T P(xy B), where P(w S) denotes the empirical probability of w in S. This means that we estimate P (xy) by P(x B) P(y B) and, if the observed probability P(xy T ) is larger than the estimated probability P (xy) = P(x B) P(y B), we say xy is quite unusual in T. In this framework, the estimation of probabilities is done like z-score with a probabilistic model, but the unit of words, such as x and y, is defined automatically using the given background set of data. In this sense, we can say that FPCS is a mixed approach of model- and data-driven approaches.

2 In [9], it is shown that, given bacterial sequences as the target and background sets, many of found peculiar compositions are exceptional by a z-score criteria, and some peculiar compositions are not [9]. This implies that we can find peculiar compositions which can not be found by z- score. In [10], it is shown that many peculiar compositions are found in biological features, such as rrna and transposase, using DNA sequences of 7 popular bacteria, such as Escherichia coli K-12 (E. coli) and Bacillus subtilis (B. subtilis). Although FPCS s effectiveness has shown, it requires the background set of sequences, in addition to the target set. Of course, it seems to be natural in bioinformatics to compare the target sequences with some other sequences. However, it is much more useful when we can use an infrequent mining algorithm for a single set of sequences. In this paper, we proposed another approach for under-represented patterns, inspired by FPCS. The proposed method requires a set of sequences and outputs infrequent patterns of the form w = xy, where x and y are frequent in the input set and P (xy) is much more frequent than its estimated value P (x)p (y). Using biological features, such as RNA, of popular bacterial DNA sequences, the effectiveness of the proposed method is evaluated. 2. Finding Peculiar Compositions According to [9], [10], we briefly explain the peculiar composition discovery problem and its significance on DNA sequences of popular bacteria. 2.1 Problem Definition Let Σ be an alphabet and an element of Σ is called a letter. In case of nucleotide sequences, Σ = {A, C, G, T }. The set of all the finite sequences of one or more letters is denoted by Σ +, and an element of Σ + is called a string. The length of a string x, denoted by x, is the number of letters of x. Consider a string x = a 1 a n (a i Σ). A letter a i in x is denoted by x[i], and a i a i+1... a j (i < j) is called a substring 1. For two strings x, y Σ +, the concatenation of x and y is denoted by xy. We call xy the composition of x and y. Conversely, a pair of two strings (x, y) is called a division of w if w = xy. There exist O( w ) divisions. For instance, if x = AAC and y = GC then xy = AACGC, and (A, ACGC), (AA, CGC),... (AACG, C) are divisions of AACGC. Let x, y Σ +. An occurrence of x in y is an integer i such that x[j] = y[i + j] (1 j x ). The frequency of x in y is the number of occurrences of x in y. We extend this notion in case of a set D of strings, instead of a string y, as follows: f D (x) to denote the sum of the frequencies of x 1 In this paper, we do not consider the empty string, that is the case i = j. in all strings in D. Since the frequency is affected by the absolute size of D, we introduce the empirical probability of x in D as the relative frequencies P(x D) = f D (x)/# D, where # D is the sum of frequencies. We define a set of positions of x in y as follows: Pos y (x) = {i + j x[j] = y[i + j], 1 j x }. For example, Pos babbab (ab) = {2, 3, 5, 6}. In other words, Pos y (x) is a set of all positions on y covered by x. It is naturally extended for a set D of substrings in y by Pos y (D) = x D Pos y (x). Note that we count only one time even if two substrings x and x share some positions in y since Pos y (D) is defined as a set. The peculiar composition discovery problem is defined as follows. Definition 1: The peculiar composition discovery problem is, given two sets T and B of strings and threshold values θ T > 1, θ B > 1 and η 2, to find all peculiar compositions of the form xy such that P(x B) > θ B P(x T ), P(y B) > θ B P(y T ), P(xy T ) > θ T P(xy B), and f T (xy) η. From the first two conditions, both frequencies of x and y are much larger in B than those in T. Therefore, we can expect that the composition xy appear frequently in B than in T. From the third condition, however, a found peculiar composition xy appear more frequently in T than in B. In this sense, xy is exceptional. 2.2 Peculiar Compositions in Biological Sequences Fig. 1 provides found peculiar compositions on a genetic map of the whole target sequence B. subtilis from 1bp at the top-left to 4,214,630bp at the bottom-right, where E. coli is used as the background sequence. A map contains two tracks. The above one is for biological features, where a feature is displayed above (resp. below) of the track line if it is in the normal strand (resp. its complement); the below track is for found peculiar compositions, where they are drawn at both strands because if a peculiar composition is found at the normal strand then its corresponding composition is also found at the corresponding position of the complement, and vice versa. Fig. 1 includes rrna colored by blue, trna lightblue, and other RNA related features navyblue as biological features, where we say that a feature is RNA related if its feature key includes RNA as a substring. A gene or CDS whose function, product, or note record contains transposon and phage is colored in green and yellow, respectively. We exclude other gene and CDS from the map because it is known that genes prevail in bacterial DNA sequences.

3 Fig. 1: A genetic map of the whole DNA sequence of B. subtilis with two tracks, where rrna, trna, other RNAs, transposon, and phage are colored in blue, lightblue, navyblue, green and yellow, respectively, on the above track, and found peculiar compositions are colored in red on the below track. We see that peculiar compositions found in case that θ T = 2.0, θ B = 2.0 and η = 3 densely appear at biological features, especially RNAs. It is known that RNAs are well preserved among species, and transposons and phages are external, and thus we can say that found peculiar compositions are useful. Fig. 2 shows three enlarged maps of B. subtilis from 1bp to around 1Mbp, where parameter values for η are changed. From these maps, we see that found peculiar compositions are densely appear at biological features, even if we use a larger value for a parameter. In [10], it is also shown that patterns extracted by the z-score and contrast patterns appearing infrequently in the target sequence can not match well to biological features. Peculiar compositions in Table 1 are found in B. subtilis, where E. coli is used as the background set, and θ T = θ B = 2.0 and η = 10. They are found in rrna (rrno-16s). 3. Peculiar Compositions of Frequent Substrings In this section, we introduce the peculiar composition of frequent substrings discovery problem. First of all, we assume a single set of sequences. To define frequent x and y, we have used T and B in the peculiar composition discovery problem. However, now we do not have B and thus we define frequent x and y by a minimum support threshold. Once we obtain frequent x and y, all we have to do is to find xy whose probability is much larger than its estimation value, P (x)p (y). The peculiar composition of frequent substrings discovery problem is defined as follows. Definition 2: The peculiar composition of frequent substrings discovery problem is, given a set T of strings and threshold values θ > 1, minsup f 2 and minsup xy 2, to find all peculiar compositions of frequent substring of the form xy such that f T (x) minsup f f T (y) minsup f P(xy T ) > θp(x T )P(y T ), and f T (xy) η. From the first two conditions, both x and y must be frequent, and the probability of xy must be larger than its expectation value P(x T )P(y T ). The last condition is for the minimum support of found patterns. 4. Experiments In this section, after describing data sets and how to evaluate, we show both qualitative and quantitative evaluation of the proposed method. 4.1 Setting The data sets used in our experiments are whole DNA sequences of four bacteria, E. coli, B. subtilis, Clostridium perfringens (C. perfringens for short), and Salmonella enterica (S. enterica for short), whose statistics are shown in

4 Fig. 2: Three maps in case η = 3, 5 and 10 from left to right, where the target set, the background one, and the other parameters θ T, θ B are fixed to B. subitilis, E. coli, 2 and 2, respectively. Table 1: Some peculiar compositions found in B. subtilis, where E. coli is used as the background set. (x, y) (f T (xy), f B (xy)) (f T (x), f B (x)) (f T (y), f B (y)) (AACGCT GG, CGGCGT G) (9, 1) (92, 389) (399, 1217) (ACGCT G, GCGGCGT ) (9, 3) (1532, 4008) (584, 1838) CGCT GGCG, GCGT G) (9, 2) (104, 895) (4009, 11412) (CGCT G, GCGGCGT ) (10, 6) (6718, 17434) (584, 1838) (9807,11362) (11461,11538) (11549,11625) (11706,14634) (14689,14808) Fig. 3: A gene map of B. subtilis from 1bp to around 20Kbp. Table 2. We have chosen E. coli and B. subtilis since they are typical model bacterium with different properties: the former is Gram-negative while the latter is Gram-positive. We have chosen C. perfringens (resp. S. enterica) since it is gram-positive (resp. gram-negative). As qualitative evaluation, we use genetic maps as shown above, and as qualitative evaluation we caluculate a popular evaluation value used in information retrieval, F -masure, which is defined as F β = (1 + β2 ) P R β 2 P + R, where P and R denote precision and recall, respectively. We choose β = 1/4 for F β although F -measure typically means F 1, which puts weight on precision and recall equally. However, our goal is not to find these features but to show that found peculiar compositions match biological features. Although found patterns seem to fully cover RNAs (see Fig. 1), they are sparse in an enlarged map of Fig. 3, where only about 20Kbp are described. From the viewpoint of our goal, we do not need high recall values since we are trying to find useful, infrequent patterns and we can t expect that infrequent patterns cover all occurrences of some features. Thus, we choose β = 1/4, which weighs precision four times as much as recall. 4.2 Genetic Maps As qualitative evaluation, we show genetic maps, like Fig. 1. First, we show a map obtained from B. subtilis with parameter values θ = , minsup f = 1000, and minsup xy = 20 (see Fig. 4). We see that dense regions of peculiar compositions of frequent substrings match to blue regions, that is, rrna and trna. Next, we show a map obtained from E. coli with parameter values θ = , minsup f = 3000, and minsup xy = 20 (see Fig. 5). In this case, we can t find dense regions even when we change parameter values. To compare FPCS, we show a genetic map of E. coli, where B. subtilis is used as the background set, θ T = θ B = 2, and η = 5 (Fig. 6). Unlike Fig. 5, we can clearly see dense regions and most of these regions correspond to some biological features. Fig. 7 shows genetic maps of C. perfringens and S. enterica. From the map of C. perfringens, we see dense regions at rrna and trna while we can t see such regions at designated features from the map of S. enterica although there exist dense regions. 4.3 F -measures In this section, we quantitatively evaluate the proposed method by the F -measure. Table 3 shows the results, where feature column shows features we consider as correct ones. For E. coli, we use RNA and transposon as target features

5 Table 2: List of DNA sequences used in experiments. Name Accession # GC% Length (bp) Gram-pos/neg E. coli NC_ ,639,675 B. subtilis NC_ ,214,630 + C. perfringens NC_ ,256,683 + S. enterica NC_ ,791,958 Fig. 4: A genetic maps of the whole DNA sequences of B. subtilis, where θ = , minsup f = 1000, and minsup xy = 20. Fig. 5: A genetic map of the whole DNA sequences of E. coli, where θ = , minsup f = 3000, and minsup xy = 20.

6 Fig. 6: A genetic map of the whole DNA sequence of E. coli, where B. subtilis is used as the background set, θ T = θ B = 2.0 and η = 5. Table 3: Precisions, recalls, and F 1/4 values E. coli and B. subtilis, where features in feature column are assumed to be correct. NC# feature θ minsup f minsup xy precision recall F 1/4 NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA NC_ RNA NC_ RNA and, for B. subtilis, we use only RNA. First of all, F -measure values for B. subtilis are much better than those for E. coli as we have seen from genetic maps. Next, we compare these results with those of FPCS [10]. F -measures obtained by FPCS are much larger than those of the proposed method. In case of E. coli, F 1/4 = for RNA and transposon, where precision is and recall is , and when B. subtilis is given, and F 1/4 = for only RNA, where precision is and recall is Conclusion We have proposed a peculiar composition of frequent substrings which requires only a single set of sequences, and evaluated both quantitatively and qualitatively. The proposed method only requires a single set and thus it is simpler than FPCS. However, F -measure values of the proposed method are much smaller than those of FPCS. The reason for this may be due to the definitions of x s and y s being frequent for final output patterns of the form xy. In FPCS, x and y are defined independently using two ratios, θ T and θ B, and thus the frequencies for them can be quite different. In fact, we see quite different frequencies of x and y in Table 1. On the other hand, being frequent is defined absolutely by one parameter value, minsup f, in the proposed method. Therefore, frequencies of x and y must be similar. Thus, FPCS is more finely granular than the proposed approach. From maps of 4, 5, 7, it seems that peculiar compositions of frequent substrings appear at RNA given gram-negative bacteria while they do not given gram-positive bacteria. It is important to validate this hypothesis with more experiments. References [1] L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman & Hall/CRC, July 2007.

7 Fig. 7: Genetic maps of the whole DNA sequence of C. perfringens (upper side) and S. enterica (lower side), where , minsup f 3000, and minsup xy 10 are used for C. perfringens and mytheta57, minsup f 2000, and minsup xy 10 for S. enterica. [2] A. Apostolico, M. E. Bock, S. Lonardi, and X. Xu, Efficient Detection of Unusual Words, J. of Comput. Biol., vol. 7, no. 1/2, pp , Jan [3] M.-Y. Leung, G. M. Marsh, and T. P. Speed, Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes, J. of Comput. Biol., vol. 3, no. 3, pp , [4] S. Schbath, An Efficient Statistic to Detect Over- and Underrepresented Words in DNA Sequences, J. of Comput. Biol., vol. 4, no. 2, pp , [5] T. Marschall and S. Rahmann, Efficient Exact Motif Discovery, Bioinformatics, vol. 25, no. 12, pp. i356 i364, [6] D. Ikeda, Characteristic Sets of Strings Common to Semi-structured Documents, in Proceedings of the Second International Conference on Discovery Science, ser. Lecture Notes in Artificial Intelligence 1721, December 1999, pp [7] X. Ji, J. Bailey, and G. Dong, Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints, Knowledge and Information Systems, vol. 11, no. 3, pp , [8] D. Ikeda and E. Suzuki, Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts, in Proc. of ECML PKDD, September 2009, pp [9] D. Ikeda, O. Maruyama, and S. Kuhara, Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis, in Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, February 2013, pp

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric

More information

Prof. Christian MICHEL

Prof. Christian MICHEL CIRCULAR CODES IN GENES AND GENOMES - 2013 - Prof. Christian MICHEL Theoretical Bioinformatics ICube University of Strasbourg, CNRS France c.michel@unistra.fr http://dpt-info.u-strasbg.fr/~c.michel/ Prof.

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Stochastic processes and Markov chains (part II)

Stochastic processes and Markov chains (part II) Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Amsterdam, The

More information

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes

Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Discovering Most Classificatory Patterns for Very Expressive Pattern Classes Masayuki Takeda 1,2, Shunsuke Inenaga 1,2, Hideo Bannai 3, Ayumi Shinohara 1,2, and Setsuo Arikawa 1 1 Department of Informatics,

More information

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Searching Sear ( Sub- (Sub )Strings Ulf Leser Searching (Sub-)Strings Ulf Leser This Lecture Exact substring search Naïve Boyer-Moore Searching with profiles Sequence profiles Ungapped approximate search Statistical evaluation of search results Ulf

More information

Bacterial Genetics & Operons

Bacterial Genetics & Operons Bacterial Genetics & Operons The Bacterial Genome Because bacteria have simple genomes, they are used most often in molecular genetics studies Most of what we know about bacterial genetics comes from the

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Pattern Structures 1

Pattern Structures 1 Pattern Structures 1 Pattern Structures Models describe whole or a large part of the data Pattern characterizes some local aspect of the data Pattern is a predicate that returns true for those objects

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

COMPUTATIONAL PROCESSES IN LIVING CELLS

COMPUTATIONAL PROCESSES IN LIVING CELLS COMPUTATIONAL PROCESSES IN LIVING CELLS Lecture 7: Formal Systems for Gene Assembly in Ciliates: the String Pointer Reduction System March 31, 2010 MDS-descriptors MDS descriptors: strings over the following

More information

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM?

Neyman-Pearson. More Motifs. Weight Matrix Models. What s best WMM? Neyman-Pearson More Motifs WMM, log odds scores, Neyman-Pearson, background; Greedy & EM for motif discovery Given a sample x 1, x 2,..., x n, from a distribution f(... #) with parameter #, want to test

More information

Discrete Probability Refresher

Discrete Probability Refresher ECE 1502 Information Theory Discrete Probability Refresher F. R. Kschischang Dept. of Electrical and Computer Engineering University of Toronto January 13, 1999 revised January 11, 2006 Probability theory

More information

Gibbs Sampling Methods for Multiple Sequence Alignment

Gibbs Sampling Methods for Multiple Sequence Alignment Gibbs Sampling Methods for Multiple Sequence Alignment Scott C. Schmidler 1 Jun S. Liu 2 1 Section on Medical Informatics and 2 Department of Statistics Stanford University 11/17/99 1 Outline Statistical

More information

Essentiality in B. subtilis

Essentiality in B. subtilis Essentiality in B. subtilis 100% 75% Essential genes Non-essential genes Lagging 50% 25% Leading 0% non-highly expressed highly expressed non-highly expressed highly expressed 1 http://www.pasteur.fr/recherche/unites/reg/

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Exhaustive search. CS 466 Saurabh Sinha

Exhaustive search. CS 466 Saurabh Sinha Exhaustive search CS 466 Saurabh Sinha Agenda Two different problems Restriction mapping Motif finding Common theme: exhaustive search of solution space Reading: Chapter 4. Restriction Mapping Restriction

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models A selection of slides taken from the following: Chris Bystroff Protein Folding Initiation Site Motifs Iosif Vaisman Bioinformatics and Gene Discovery Colin Cherry Hidden Markov Models

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part I) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Pair Hidden Markov Models

Pair Hidden Markov Models Pair Hidden Markov Models Scribe: Rishi Bedi Lecturer: Serafim Batzoglou January 29, 2015 1 Recap of HMMs alphabet: Σ = {b 1,...b M } set of states: Q = {1,..., K} transition probabilities: A = [a ij ]

More information

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough Search Search is a key component of intelligent problem solving Search can be used to Find a desired goal if time allows Get closer to the goal if time is not enough section 11 page 1 The size of the search

More information

Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach.

Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach. Evaluation of the Number of Different Genomes on Medium and Identification of Known Genomes Using Composition Spectra Approach Valery Kirzhner *1 & Zeev Volkovich 2 1 Institute of Evolution, University

More information

1 Alphabets and Languages

1 Alphabets and Languages 1 Alphabets and Languages Look at handout 1 (inference rules for sets) and use the rules on some examples like {a} {{a}} {a} {a, b}, {a} {{a}}, {a} {{a}}, {a} {a, b}, a {{a}}, a {a, b}, a {{a}}, a {a,

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

Supplementary Information

Supplementary Information Supplementary Information 1 List of Figures 1 Models of circular chromosomes. 2 Distribution of distances between core genes in Escherichia coli K12, arc based model. 3 Distribution of distances between

More information

More Codon Usage Bias

More Codon Usage Bias .. CSC448 Bioinformatics Algorithms Alexander Dehtyar.. DA Sequence Evaluation Part II More Codon Usage Bias Scaled χ 2 χ 2 measure. In statistics, the χ 2 statstic computes how different the distribution

More information

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010 Hidden Lecture 4: Hidden : An Introduction to Dynamic Decision Making November 11, 2010 Special Meeting 1/26 Markov Model Hidden When a dynamical system is probabilistic it may be determined by the transition

More information

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008

Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns

Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns Biclustering Gene-Feature Matrices for Statistically Significant Dense Patterns Mehmet Koyutürk, Wojciech Szpankowski, and Ananth Grama Dept. of Computer Sciences, Purdue University West Lafayette, IN

More information

CHAPTER : Prokaryotic Genetics

CHAPTER : Prokaryotic Genetics CHAPTER 13.3 13.5: Prokaryotic Genetics 1. Most bacteria are not pathogenic. Identify several important roles they play in the ecosystem and human culture. 2. How do variations arise in bacteria considering

More information

Hidden Markov Models

Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

NOTE: Questions are written on both sides of the sheets of paper making up this exam booklet!

NOTE: Questions are written on both sides of the sheets of paper making up this exam booklet! Biology 1010 Section A Midterm 1 January 30, 2008 (print): ANSWER KEY Name (signature): Student I.D. #: Time: 50 minutes Read the following instructions: 1. Do not open the examination until you are instructed

More information

Markov Models & DNA Sequence Evolution

Markov Models & DNA Sequence Evolution 7.91 / 7.36 / BE.490 Lecture #5 Mar. 9, 2004 Markov Models & DNA Sequence Evolution Chris Burge Review of Markov & HMM Models for DNA Markov Models for splice sites Hidden Markov Models - looking under

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes

Hidden Markov Models. based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes Hidden Markov Models based on chapters from the book Durbin, Eddy, Krogh and Mitchison Biological Sequence Analysis via Shamir s lecture notes music recognition deal with variations in - actual sound -

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Computational Cell Biology Lecture 4

Computational Cell Biology Lecture 4 Computational Cell Biology Lecture 4 Case Study: Basic Modeling in Gene Expression Yang Cao Department of Computer Science DNA Structure and Base Pair Gene Expression Gene is just a small part of DNA.

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3)

Darwin's theory of natural selection, its rivals, and cells. Week 3 (finish ch 2 and start ch 3) Darwin's theory of natural selection, its rivals, and cells Week 3 (finish ch 2 and start ch 3) 1 Historical context Discovery of the new world -new observations challenged long-held views -exposure to

More information

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs15.html Describing & Modeling Patterns

More information

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes

De novo identification of motifs in one species. Modified from Serafim Batzoglou s lecture notes De novo identification of motifs in one species Modified from Serafim Batzoglou s lecture notes Finding Regulatory Motifs... Given a collection of genes that may be regulated by the same transcription

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Matrix-based pattern discovery algorithms

Matrix-based pattern discovery algorithms Regulatory Sequence Analysis Matrix-based pattern discovery algorithms Jacques.van.Helden@ulb.ac.be Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

More information

A DNA Sequence 2017/12/6 1

A DNA Sequence 2017/12/6 1 A DNA Sequence ccgtacgtacgtagagtgctagtctagtcgtagcgccgtagtcgatcgtgtgg gtagtagctgatatgatgcgaggtaggggataggatagcaacagatgagc ggatgctgagtgcagtggcatgcgatgtcgatgatagcggtaggtagacttc gcgcataaagctgcgcgagatgattgcaaagragttagatgagctgatgcta

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Genetic Basis of Variation in Bacteria

Genetic Basis of Variation in Bacteria Mechanisms of Infectious Disease Fall 2009 Genetics I Jonathan Dworkin, PhD Department of Microbiology jonathan.dworkin@columbia.edu Genetic Basis of Variation in Bacteria I. Organization of genetic material

More information

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Interpolated Markov Models for Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the

More information

Topology. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores.

Topology. 1 Introduction. 2 Chromosomes Topology & Counts. 3 Genome size. 4 Replichores and gene orientation. 5 Chirochores. Topology 1 Introduction 2 3 Genome size 4 Replichores and gene orientation 5 Chirochores 6 G+C content 7 Codon usage 27 marc.bailly-bechet@univ-lyon1.fr The big picture Eukaryota Bacteria Many linear chromosomes

More information

Text mining and natural language analysis. Jefrey Lijffijt

Text mining and natural language analysis. Jefrey Lijffijt Text mining and natural language analysis Jefrey Lijffijt PART I: Introduction to Text Mining Why text mining The amount of text published on paper, on the web, and even within companies is inconceivably

More information

EM algorithm and applications Lecture #9

EM algorithm and applications Lecture #9 EM algorithm and applications Lecture #9 Bacground Readings: Chapters 11.2, 11.6 in the text boo, Biological Sequence Analysis, Durbin et al., 2001.. The EM algorithm This lecture plan: 1. Presentation

More information

Gaussian Mixture Model

Gaussian Mixture Model Case Study : Document Retrieval MAP EM, Latent Dirichlet Allocation, Gibbs Sampling Machine Learning/Statistics for Big Data CSE599C/STAT59, University of Washington Emily Fox 0 Emily Fox February 5 th,

More information

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification

Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification Evaluation of the relative contribution of each STRING feature in the overall accuracy operon classification B. Taboada *, E. Merino 2, C. Verde 3 blanca.taboada@ccadet.unam.mx Centro de Ciencias Aplicadas

More information

DNA Feature Sensors. B. Majoros

DNA Feature Sensors. B. Majoros DNA Feature Sensors B. Majoros What is Feature Sensing? A feature is any DNA subsequence of biological significance. For practical reasons, we recognize two broad classes of features: signals short, fixed-length

More information

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci

Association Rules Information Retrieval and Data Mining. Prof. Matteo Matteucci Association Rules Information Retrieval and Data Mining Prof. Matteo Matteucci Learning Unsupervised Rules!?! 2 Market-Basket Transactions 3 Bread Peanuts Milk Fruit Jam Bread Jam Soda Chips Milk Fruit

More information

Chapter 2 Class Notes Words and Probability

Chapter 2 Class Notes Words and Probability Chapter 2 Class Notes Words and Probability Medical/Genetics Illustration reference Bojesen et al (2003), Integrin 3 Leu33Pro Homozygosity and Risk of Cancer, J. NCI. Women only 2 x 2 table: Stratification

More information

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics.

Bioinformatics. Genotype -> Phenotype DNA. Jason H. Moore, Ph.D. GECCO 2007 Tutorial / Bioinformatics. Bioinformatics Jason H. Moore, Ph.D. Frank Lane Research Scholar in Computational Genetics Associate Professor of Genetics Adjunct Associate Professor of Biological Sciences Adjunct Associate Professor

More information

Bio 119 Bacterial Genomics 6/26/10

Bio 119 Bacterial Genomics 6/26/10 BACTERIAL GENOMICS Reading in BOM-12: Sec. 11.1 Genetic Map of the E. coli Chromosome p. 279 Sec. 13.2 Prokaryotic Genomes: Sizes and ORF Contents p. 344 Sec. 13.3 Prokaryotic Genomes: Bioinformatic Analysis

More information

Frequently Asked Questions (FAQs)

Frequently Asked Questions (FAQs) Frequently Asked Questions (FAQs) Q1. What is meant by Satellite and Repetitive DNA? Ans: Satellite and repetitive DNA generally refers to DNA whose base sequence is repeated many times throughout the

More information

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016 Boolean models of gene regulatory networks Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016 Gene expression Gene expression is a process that takes gene info and creates

More information

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p.

Biology 105/Summer Bacterial Genetics 8/12/ Bacterial Genomes p Gene Transfer Mechanisms in Bacteria p. READING: 14.2 Bacterial Genomes p. 481 14.3 Gene Transfer Mechanisms in Bacteria p. 486 Suggested Problems: 1, 7, 13, 14, 15, 20, 22 BACTERIAL GENETICS AND GENOMICS We still consider the E. coli genome

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Hidden Markov

More information

MONTGOMERY COUNTY COMMUNITY COLLEGE BIO 140 CHAPTER 4. Functional Anatomy of Prokaryotic and Eukaryotic Cells

MONTGOMERY COUNTY COMMUNITY COLLEGE BIO 140 CHAPTER 4. Functional Anatomy of Prokaryotic and Eukaryotic Cells MONTGOMERY COUNTY COMMUNITY COLLEGE BIO 140 CHAPTER 4 Functional Anatomy of Prokaryotic and Eukaryotic Cells I. PROKARYOTES A. Structure Of The Cell: Chemical Composition And Function 1. Cell Wall a. composition

More information

Hidden Markov Models and some applications

Hidden Markov Models and some applications Oleg Makhnin New Mexico Tech Dept. of Mathematics November 11, 2011 HMM description Application to genetic analysis Applications to weather and climate modeling Discussion HMM description Application to

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Model Building: Selected Case Studies

Model Building: Selected Case Studies Chapter 2 Model Building: Selected Case Studies The goal of Chapter 2 is to illustrate the basic process in a variety of selfcontained situations where the process of model building can be well illustrated

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Introduction to Bioinformatics Integrated Science, 11/9/05

Introduction to Bioinformatics Integrated Science, 11/9/05 1 Introduction to Bioinformatics Integrated Science, 11/9/05 Morris Levy Biological Sciences Research: Evolutionary Ecology, Plant- Fungal Pathogen Interactions Coordinator: BIOL 495S/CS490B/STAT490B Introduction

More information

Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability

Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability Annotation of promoter regions in microbial genomes 851 Identification and annotation of promoter regions in microbial genome sequences on the basis of DNA stability VETRISELVI RANGANNAN and MANJU BANSAL*

More information

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25,

Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, Algorithms in Bioinformatics II SS 07 ZBIT, C. Dieterich, (modified script of D. Huson), April 25, 200707 Motif Finding This exposition is based on the following sources, which are all recommended reading:.

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages

What Is a Language? Grammars, Languages, and Machines. Strings: the Building Blocks of Languages Do Homework 2. What Is a Language? Grammars, Languages, and Machines L Language Grammar Accepts Machine Strings: the Building Blocks of Languages An alphabet is a finite set of symbols: English alphabet:

More information

Closure under the Regular Operations

Closure under the Regular Operations September 7, 2013 Application of NFA Now we use the NFA to show that collection of regular languages is closed under regular operations union, concatenation, and star Earlier we have shown this closure

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss Methods Identification of orthologues, alignment and evolutionary distances A preliminary set of orthologues was

More information

Inferring the Causal Decomposition under the Presence of Deterministic Relations.

Inferring the Causal Decomposition under the Presence of Deterministic Relations. Inferring the Causal Decomposition under the Presence of Deterministic Relations. Jan Lemeire 1,2, Stijn Meganck 1,2, Francesco Cartella 1, Tingting Liu 1 and Alexander Statnikov 3 1-ETRO Department, Vrije

More information

Reprinted from the Bulletin of Informatics and Cybernetics Research Association of Statistical Sciences,Vol.41

Reprinted from the Bulletin of Informatics and Cybernetics Research Association of Statistical Sciences,Vol.41 LEARNABILITY OF XML DOCUMENT TRANSFORMATION RULES USING TYPED EFSS by Noriko Sugimoto Reprinted from the Bulletin of Informatics and Cybernetics Research Association of Statistical Sciences,Vol.41 FUKUOKA,

More information

Revisiting the Central Dogma The role of Small RNA in Bacteria

Revisiting the Central Dogma The role of Small RNA in Bacteria Graduate Student Seminar Revisiting the Central Dogma The role of Small RNA in Bacteria The Chinese University of Hong Kong Supervisor : Prof. Margaret Ip Faculty of Medicine Student : Helen Ma (PhD student)

More information

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Hidden Markov Models 1

Hidden Markov Models 1 Hidden Markov Models Dinucleotide Frequency Consider all 2-mers in a sequence {AA,AC,AG,AT,CA,CC,CG,CT,GA,GC,GG,GT,TA,TC,TG,TT} Given 4 nucleotides: each with a probability of occurrence of. 4 Thus, one

More information

Hidden Markov Models. x 1 x 2 x 3 x K

Hidden Markov Models. x 1 x 2 x 3 x K Hidden Markov Models 1 1 1 1 2 2 2 2 K K K K x 1 x 2 x 3 x K Viterbi, Forward, Backward VITERBI FORWARD BACKWARD Initialization: V 0 (0) = 1 V k (0) = 0, for all k > 0 Initialization: f 0 (0) = 1 f k (0)

More information

Computational Systems Biology

Computational Systems Biology Computational Systems Biology Vasant Honavar Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Graduate Program Center for Computational Intelligence, Learning, & Discovery

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Supporting Information

Supporting Information Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of

More information

CS 341 Homework 16 Languages that Are and Are Not Context-Free

CS 341 Homework 16 Languages that Are and Are Not Context-Free CS 341 Homework 16 Languages that Are and Are Not Context-Free 1. Show that the following languages are context-free. You can do this by writing a context free grammar or a PDA, or you can use the closure

More information

RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology"

RNA Search and! Motif Discovery Genome 541! Intro to Computational! Molecular Biology RNA Search and! Motif Discovery" Genome 541! Intro to Computational! Molecular Biology" Day 1" Many biologically interesting roles for RNA" RNA secondary structure prediction" 3 4 Approaches to Structure

More information

Lecture 5. How DNA governs protein synthesis. Primary goal: How does sequence of A,G,T, and C specify the sequence of amino acids in a protein?

Lecture 5. How DNA governs protein synthesis. Primary goal: How does sequence of A,G,T, and C specify the sequence of amino acids in a protein? Lecture 5 (FW) February 4, 2009 Translation, trna adaptors, and the code Reading.Chapters 8 and 9 Lecture 5. How DNA governs protein synthesis. Primary goal: How does sequence of A,G,T, and C specify the

More information

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY

MEME - Motif discovery tool REFERENCE TRAINING SET COMMAND LINE SUMMARY Command line Training Set First Motif Summary of Motifs Termination Explanation MEME - Motif discovery tool MEME version 3.0 (Release date: 2002/04/02 00:11:59) For further information on how to interpret

More information

Hub Gene Selection Methods for the Reconstruction of Transcription Networks

Hub Gene Selection Methods for the Reconstruction of Transcription Networks for the Reconstruction of Transcription Networks José Miguel Hernández-Lobato (1) and Tjeerd. M. H. Dijkstra (2) (1) Computer Science Department, Universidad Autónoma de Madrid, Spain (2) Institute for

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Lecture 5: Processes and Timescales: Rates for the fundamental processes 5.1

Lecture 5: Processes and Timescales: Rates for the fundamental processes 5.1 Lecture 5: Processes and Timescales: Rates for the fundamental processes 5.1 Reading Assignment for Lectures 5-6: Phillips, Kondev, Theriot (PKT), Chapter 3 Life is not static. Organisms as a whole are

More information