Evolutionary dynamics of conserved. non-coding DNA elements: Big bang. or gradual accretion? Sujai Kumar

Size: px
Start display at page:

Download "Evolutionary dynamics of conserved. non-coding DNA elements: Big bang. or gradual accretion? Sujai Kumar"

Transcription

1 Evolutionary dynamics of conserved non-coding DNA elements: Big bang or gradual accretion? Sujai Kumar Master of Science School of Informatics University of Edinburgh 2007

2 Abstract Background Previous studies have found that DNA elements are highly conserved in species from the same lineage, even though they do not code for proteins or RNA. One proposed function of such conserved non-coding elements (CNEs) is that they are cis-regulatory sequences for developmental genes which act as an abstraction of genetic regulatory networks, thus allowing new animal body plans to be specified in a modular way. This thesis tests the specific proposal by a previous study that CNEs arose in a big bang in the Precambrian, approximately 600 million years ago. Results The evolutionary dynamics of CNEs were studied by first identifying the elements, and then examining their levels of identity over time. Pairwise comparative sequence analysis of five contemporary nematode species provided a window into the past because these species diverged at different points of time over the last approximately 700 million years. The number of CNEs and their basic properties for the three most recently diverged species match the results obtained by other researchers, although no clear trend is visible in the change in identity of CNEs with respect to time since divergence. On adding two more species to the analysis, it was found that no such elements could be identified for species pairs with deep divergences. Conclusions The absence of CNEs for pairwise comparisons of species that diverged earliest indicates that CNEs did not arise in a big bang. CNEs that were found for the three Caenorhabditis species that diverged relatively recently (approximately 100 million years ago) seem to be specific only to that clade. However, the big bang hypothesis cannot be conclusively discarded because it is possible that the elements exist, but are short, or have multiple components spread across the genome, and are therefore difficult to detect. Missing CNEs could therefore be a limitation of computational approaches to discovering CNEs, and this study also suggests some ways to overcome those limitations. i

3 Acknowledgements I am very grateful to Alasdair Anthony and Ann Hedley at the Institute of Evolutionary Biology for getting me started on the mechanics of this project. Other members of the lab group also patiently heard my semi-formed thoughts on conserved non-coding elements, asked penetrating questions, and offered useful advice from time to time. Many thanks are also due to Douglas Armstrong for providing access to excellent computing resources and for helping with all administrative aspects of the MSc course. Most importantly, I would like to thank Mark Blaxter whose enthusiasm for life and all living things is contagious. He was the inspiration behind this project and provided much advice, encouragement, and pizza over the course of the summer. ii

4 Contents 1 Introduction Conserved Non-coding Elements (CNEs) Conservation of DNA Non-coding regions of the genome Properties and proposed functionality of CNEs Hypothesis and approach Scope Structure Methods and Materials Obtaining genome sequences Finding CNEs Finding conserved portions Removing coding regions Determining CNE similarities Results CNE counts CNEs found using methodology and parameters from Vavouri et al CNEs found after additional steps to remove coding regions CNEs found for higher sensitivity levels CNEs shared across all pairs of species CNEs for C. elegans, C. brigssae, and C. remanei Aggregate properties of CNEs iii

5 4 Discussion Rejection of big bang hypothesis Limitations of current study, and future work Appendix: Coding regions in GFF files 37 Bibliography 40 iv

6 List of Figures 1.1 Phylogenetic tree of five nematode species compared in this study (Caenorhabditis divergences from Stein et al., 2003; B. malayi and T. spiralis divergences from Vanfleteren et al., 1994) Identity vs time plots to verify evolutionary dynamics of CNEs. A minimum 25% identity is expected in all cases (dotted line) because sequences are made up of only four bases: A, T, G, and C. Because background nucleotide concentrations are biased (e.g. lower G-C levels), the minimum level of identity would be higher than 25% (dotdashed line) Example fragment of a nucleotide FASTA file Steps for finding CNEs (a) Fragment of a megablast results file (column headers: q = query identifier, t = target database identifier, %id = percentage identity, len = length of alignment, mis = number of mismatches in alignment, gap = number of gaps, q_st = starting coordinate of query sequence, q_en = query end, t_st = target start, t_en = target end, e = expect value, bit = bit score) and (b) the output of combile-mbl.pl for that fragment Ten sample lines of GFF annotation file for C. briggsae Chromosome I. The highlighted entries depict coding regions. If the coordinates of a megablast result overlapped these coordinates, then it was discarded as a conserved coding region. Eventually, only putative conserved non-coding elements (CNEs) remained. See Appendix for the complete list Fragment of coding region file with asterisks used to tag GFF source and feature combinations that specified a coding region v

7 3.1 Characteristics of CNEs found for each pair (y axis), plotted against the time since divergence of the species in that pair (x axis): a) length, b) bit-score, and c) percentage identity. Because several pairs share the same time since divergence (such as C. elegans C. briggsae, and C. elegans C. remanei, both 100 MYA), this plot jitters the locations along the x-axis to make it easy to identify the data points for each pair CNE percentage identity versus length, visualized as a scatterplot and as a 3D histogram, for C. briggsae C. remanei, C. elegans C. briggsae, and C. elegans C. remanei vi

8 List of Tables 1.1 Level of conservation of CNEs in different groups of species Sources for whole genome sequences CNEs found for all ten pairs (in alphabetical order) of five nematode species using the method and parameters from Vavouri et al. (2007) for finding CNEs (with megablast parameters -W 30, -e 0.001) CNEs found after additional checks to determine coding regions (-W 30, -e 0.001) CNEs found using different sensitivity parameter settings Results of clustering CNEs found at different sensitivity levels Comparisons of mean length, bit-scores, and percentage identity for CNEs shared in two comparisons: C. briggsae C. remanei (diverged 80 MYA), and C. elegans C. briggsae (diverged 100 MYA) Comparisons of mean length, bit-scores, and percentage identity for CNEs shared in two comparisons: C. briggsae C. remanei (diverged 80 MYA), and C. elegans C. remanei (diverged 100 MYA) vii

9 Chapter 1 Introduction The increasing availability of full genome sequences has led to many comparative studies that have examined the non-protein-coding part of genomes. Over the last decade, several non-coding elements have been found that are completely conserved or conserved with a high degree of identity in species as diverse as Homo sapiens and Fugu rubripes (the Japanese pufferfish) which last shared a common ancestor approximately 450 million years ago (MYA). This level of conservation indicates that such sequences are functional even though they do not code for proteins or RNA. The real function of such elements remains an open question. Understanding the evolution of non-coding DNA has the potential to address questions such as how genomes evolved and how they are still evolving. More importantly, it gives us a way to attempt to answer fundamental questions such as how the incredible complexity of life came to be. This thesis analyses the evolutionary dynamics of conserved non-coding elements (CNEs). It builds on the foundation laid by Vavouri et al. (2007) where they proposed that CNEs are regulatory elements for developmental genes and that it was the rewiring" of CNEs that led to evolution of the vast diversity of animal body plans. In their study, they compared the genomes of three species from the phylum Nematoda: Caenorhabditis elegans, Caenorhabditis briggsae, and Caenorhabditis remanei. Their analysis is replicated here, and two additional species from the same phylum were added for which full genome sequences have recently become available: Brugia malayi and Trichinella spiralis. These five species last shared a common ancestor more than 600 MYA and the comparative analysis in this thesis helps answer whether CNEs arose only once (in a big bang ) in the Precambrian as proposed by Vavouri et al., or emerged gradually through evolutionary history. Answering this question would provide us with insights into the process of evolution, may allow us to better understand how animal body plans are specified, and let us speculate whether another explosion in species diversity (of the kind seen 600 MYA) is possible in the future. 1

10 To test the big bang hypothesis, the main analytical method employed in this thesis was the level of similarity between such elements for species that diverged at different times. In subsections 1.1 and 1.2, the current understanding of CNEs is reviewed as background material for understanding the hypothesis. The hypothesis and the approach used to test it are presented in detail in subsection 1.3. In the last parts of this introductory chapter, the scope and structure of the remaining chapters of this thesis are presented. 1.1 Conserved Non-coding Elements (CNEs) Conserved non-coding elements (CNEs) are a recent discovery in several cross-species comparisons. Research groups have not yet decided on a common term for them and each has its own acronym for such sequences: CNE - Conserved Non-coding Element (Vavouri et al., 2007, 2006, Woolfe et al., 2005) UCR - Ultra-Conserved non-coding Region (Sandelin et al., 2004) MCS - Multi-species Conserved Sequence (Margulies et al., 2003) CNG - Conserved Non-Genic sequence (Dermitzakis et al., 2005) HCE - Highly Conserved Element (Siepel et al., 2005) This thesis uses the term CNE because it builds on the research and claims made by Vavouri et al. (2007, 2006) and Woolfe et al. (2005), and because the term captures both key aspects of such sequences: that they are conserved across species, and that they are non-coding for proteins or RNA Conservation of DNA Conserved DNA sequences are interesting because they indicate that the sequences are functional. Non-functional sections of the genome undergo mutation and drift apart as species diverge away from each other. Functional sections of the genome remain recognisably similar over long time periods because they code for proteins, code for RNA, are structural, or act as regulatory sites for enhancers, promoters, repressors, and so on. If a section of DNA has such a functional role, it will be under purifying selection and is likely to remain the same or similar over millions of years of mutation pressure. This is the key idea behind all comparative analyses across species, and is a way of identifying functional parts of the genome. 2

11 1.1.2 Non-coding regions of the genome Historically, protein-coding genes were the focus of genome studies (Bird et al., 2006) and a sequence was considered interesting only if it was transcribed as a protein or as RNA. As better experimental and informatics technologies were developed, it was discovered that protein and RNA coding genes only account for small proportions of the whole genome (1.5% to 25% in animal genomes). Comparative genomics studies have identified non-coding regions that appear to be highly conserved and although some parts are now understood to be a complex interacting network of regulatory elements, the functionality of other non-coding parts remains unknown Properties and proposed functionality of CNEs CNEs have been identified for groups of vertebrates and invertebrates separately. Although no sequence identity has been discovered so far between CNEs in vertebrates and CNEs in invertebrates, they share characteristics such as: High levels of identity (higher than that of protein-coding genes in most cases), across a wide range of species: Table 1.1 summarizes the level of conservation of CNEs for different groups of species. Although the data are from different sources and do not use the same measures of identity, the figures provide a general idea of how much these sequences are conserved even in the case of species that diverged approximately 450 MYA. Clustering around genes: The density of CNEs is higher in gene-rich regions in humans (Bejerano et al., 2004, Sandelin et al., 2004, Woolfe et al., 2005) and nematodes (Vavouri et al., 2007), with several CNEs clustered around each gene. Association with developmental genes: Gene association is determined by looking for the transcription start site nearest to each CNE. CNE-associated genes seem to be enriched for regulators of development such as transcription factors and signalling genes (Sandelin et al., 2004, McEwen et al., 2006). CNEs also exhibit other interesting properties that are not yet understood, such as a spike in AT frequency just inside CNE boundaries (in sharp contrast to flanking regions, Vavouri et al., 2007) and that their AT frequencies are similar (~65%) across species despite the background AT content of each genome being different. Based on the properties listed above (high identity levels, association with developmental genes, specificity to phyla) and on experiments testing the functionality of 3

12 Table 1.1: Level of conservation of CNEs in different groups of species Species Compared Number of CNEs and Level of Fruit Flies (Glazov et al., 2005): Drosophila melanogaster, Drosophila pseudoobscura Mammals (Bejerano et al., 2004): Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat) Nematodes (Vavouri et al., 2007): Caenorhabditis elegans, Caenorhabditis briggsae, Caenorhabditis remanei Vertebrates (Woolfe et al., 2005): Homo sapiens (Human), Takifugu rubripes (Pufferfish) Identity elements; 100% identity; Length > 50bp 256 elements; 100% identity; Length > 200bp 2084 elements; megablast word seed size 30bp (W30) with e-value threshold 0.001; Average length 69bp 1373 elements; 84% identity; Average length ~200bp Last Common Ancestor MYA 55 MYA 100 MYA 450 MYA CNEs, the most likely function of CNEs is that they are cis-elements that regulate the transcription of a core set of developmental regulatory genes in each species. Cis-elements are regions of DNA that lie on the same strand as the gene they regulate. Genetic regulatory networks (GRNs) use cis-elements extensively to regulate the complex production of proteins with the help of biological controls such as signal transducers, switches, feedback loops, feedforward loops, and combinatorial functions such as and and or relationships (Andrianantoandro et al., 2006). Development of the animal body plan is controlled by large GRNs and changes in core developmental GRNs can result in new animal body plans (Davidson and Erwin, 2006). Vavouri et al. (2007) proposed that the initial emergence and subsequent modification of CNEs associated with GRNs was responsible for the evolution of new animal body plans. According to this theory, each animal group has a different set of CNEs because the core GRN for that animal group evolved as a result of the rewiring of CNEs. The vast diversity of body plans first seen in the fossil record of the Cambrian indicates that an evolutionary explosion started in the Precambrian and it is possible that the non-coding elements conserved in modern species are hard-wired traces of the changes that took place in core developmental regulatory networks. Because CNEs were highly conserved across species in the same family (e.g., mammals, Dermitzakis et al., 2005) and showed no conservation at all across species from different families (e.g., between humans and nematodes), it is plausible that CNEs are linked in this way to the specification of body plans. The claim that CNEs arose in a big bang in a short period of time around the Precambrian and are responsible for the profusion of animal body plans at that time, is interesting and is testable. If the claim is true, it should be possible to identify the same CNEs in all branches of a phylum. The alternative is that CNEs arise from time to time, and that new CNEs can be seen in every branch of the phylogenetic tree. 4

13 Figure 1.1: Phylogenetic tree of five nematode species compared in this study (Caenorhabditis divergences from Stein et al., 2003; B. malayi and T. spiralis divergences from Vanfleteren et al., 1994) The next subsection frames this hypothesis as a specific question, and outlines the approach taken in this study to determine the evolutionary dynamics of CNEs. 1.2 Hypothesis and approach The main goal of this study is to find evidence for or against the idea that CNEs arose in a big bang once, several hundred million years ago. The overall strategy for doing this is to compare genomes from the same family that diverged at different points of time over the last half a billion years, find CNEs in these genomes, and see how their levels of identity have changed over time. To study the question of how the CNE identities have changed, species from a single phylum are needed because CNEs are not conserved across different phyla. The phylum Nematoda is ideally suited for this purpose because five species within this phylum have recently been sequenced completely, and the approximate dates of divergence for these species span a long time period (from 80 to 700 MYA): Caenorhabditis elegans, Caenorhabditis briggsae, Caenorhabditis remanei, Brugia malayi, and Trichinella spiralis. The dates in Figure 1.1 are estimates that have an error of ±20 MYA for the three Caenorhabditis species, and the error increases to ±100 MYA for the branch points of B. malayi and T. spiralis. A CNE is found by comparing the genomes of two species, identifying the conserved elements, and removing those elements that are known to code for proteins or RNA. Thus, CNEs are defined for pairs of species for the purposes of this study, and their level of identity for each pair can be determined. Although CNEs are generally conserved with high levels of identity compared to protein-coding sequences, a pair of species that diverged relatively recently (e.g., C. briggsae and C. remanei, 80 MYA) would be expected to share CNEs with an even higher level of identity than a pair that diverged much earlier (e.g., C. elegans and B. malayi, 500 MYA). 5

14 Identity Time since divergence MYA (a) Supporting big bang model Identity Time since divergence MYA (b) Supporting gradual accretion model Figure 1.2: Identity vs time plots to verify evolutionary dynamics of CNEs. A minimum 25% identity is expected in all cases (dotted line) because sequences are made up of only four bases: A, T, G, and C. Because background nucleotide concentrations are biased (e.g. lower G-C levels), the minimum level of identity would be higher than 25% (dot-dashed line). If CNEs arose in the phylum Nematoda in a big bang, then at least some of the CNEs should be present in all ten pairwise comparisons between the five species, and one would expect the level of similarity for each pair to decrease with the time since divergence of that species. For each CNE, if a plot was created where the x-axis represented time since divergence and the y-axis represented level of similarity (percentage identity), then the plot might look something like Figure 1.2a for each CNE had it arisen in a big bang only once. On the other hand, if some CNEs are gradually recruited to the genome, then one would expect to see an identity vs time plot as in Figure 1.2b. That is, no corresponding CNEs would exist for pairs that diverged more than a certain number of years ago. The aim of this thesis is to find CNEs for each of the pairwise comparisons of the five nematode species, and to determine how the identity or similarity of CNEs depends on the time since the pair diverged. Examining the pairwise identities for each CNE would provide evidence for or against the big bang theory of CNE emergence. It is also possible that some CNEs arose in a big bang during the Precambrian, whereas others arose much later on the evolutionary timeline. Davidson and Erwin (2006) 6

15 point out that animal GRNs have levels of hierarchy. Those CNEs associated with the core or kernel GRNs might show big bang features (Fig. 1.2a) because they were responsible for specifying the nematode body plan, whereas those associated with peripheral gene networks might have arisen relatively recently within the individual branches of the phylogenetic tree (Fig. 1.2b). 1.3 Scope The analysis reported in this thesis draws extensively on existing bioinformatics tools and databases, and several new prorams were written to manage the process of finding CNEs and analysing their levels of identity for ten pairs of species. These programs were designed to be very efficient because genome annotation files for well annotated species can be several gigabytes in size and need to be searched rapidly to decide if a particular sequence is coding or non-coding. Previous research on nematode CNEs had concentrated only on the the three Caenorhabditis species: elegans, briggsae, and remanei. The complete genome sequences of Brugia malayi and Trichinella spiralis have only recently become available and this is the first study to look at conserved non-coding elements in all five species at the same time. Although past studies have explored other properties of CNEs (such as their AT frequency, gene association, etc.), this study only concentrates on the level of identity or similarity of the CNEs found in each pair of species, because the goal is to look for evidence for or against the big bang hypothesis of CNE emergence. 1.4 Structure The rest of this thesis is organized as follows. Chapter 2 describes the Methods and Materials used to carry out the study. This part includes details on all the steps used to find CNEs, ranging from descriptions of the data sources to the optimizations carried out to speed up the process of discovering coding regions that had to be eliminated to discover CNEs. Chapter 3 presents the results, beginning with a summary of the numbers of CNEs found for each pair at different sensitivity levels. The CNEs found are then clustered to determine which CNEs are shared across multiple pairwise comparisons. An analysis follows, describing the trends in identity for several thousand CNEs found in the three Caenorhabditis comparisons. The last part of Chapter 3 describes some aggregate properties of the CNEs found. 7

16 In the Discussion (Chapter 4), the results are summarized in the context of the original hypothesis. Although the evidence points to a rejection of the hypothesis, the hypothesis cannot be discarded with certainty for several reasons that are presented in detail. The limitations of this study and suggestions for future work complete the thesis. 8

17 Chapter 2 Methods and Materials The following three steps broadly describe the method for testing the hypothesis that CNEs arose in a big bang: 1. Obtain complete genome sequences for the five species being considered 2. Find CNEs (a) Find the conserved portions for each possible pair of species (b) Out of the conserved portions, remove those that overlap known coding regions to identify CNEs 3. Determine CNE similarities for pairs of species that diverged at different points of time Each step presented its own set of challenges and several choices had to be made for each of the steps above. The challenges, decisions, and the reasons for those decisions are described in the next three subsections. Several Perl scripts were written for processing the data at each step. These programs are described in this chapter and the source code for the programs is available at Obtaining genome sequences Genome sequences for each of the five species were obtained in FASTA format (Pearson and Lipman, 1988) from the sources shown in Table 2.1. FASTA files are a standard 9

18 Table 2.1: Sources for whole genome sequences. Species Source Notes C. elegans The C. elegans Sequencing Consortium (1998), sequence obtained from WormBase (Bieri et al., 2007): C. briggsae Stein et al. (2003), sequence obtained from WormBase (Bieri et al., 2007) C. remanei GSC (2007a), sequence obtained from WormBase (Bieri et al., 2007) B. malayi Ghedin et al. (2007), sequence obtained from Blaxter (2007) Release WS170 Release CB3 Assembly T. spiralis GSC (2007b) Release 1.0 way of storing and sharing sequence information in text files, and typically consist of a header line that contains the sequence identifier, followed by a series of alphabets that denote nucleotides (in the case of DNA or RNA sequences) or amino acids (in the case of protein sequences). Figure 2.1 shows the first 200 bases of the X chromosome of C. elegans in FASTA format. Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Figure 2.1: Example fragment of a nucleotide FASTA file C. elegans and C. briggsae are the most extensively studied of these five species and thus had the most complete genomes (The C. elegans Sequencing Consortium, 1998, Stein et al., 2003) where all the bases were assigned to chromosomes. The other genomes were only available organized as contigs. A contig is the contiguous consensus sequence derived from overlapping DNA fragments that have been sequenced. The precise chromosomal location of a contig is not known, but for the purpose of this study the base sequences were enough to discover conserved regions. The latest releases available at the time of this study were used for the results reported here. However, while writing the programs to discover CNEs, older releases of C. elegans (WS140) and C. briggsae (CB25) were also downloaded from the FTP archives at WormBase (Bieri et al., 2007) to verify that the programs discovered exactly the same CNEs as Vavouri et al. (2007). For C. remanei, a newer assembly was available halfway through the project, but an earlier version was used because genome annotations were available for that version. B. malayi and T. spiralis had been sequenced most recently and thus only one version was available for those species. For the three Caenorhabditis species, annotation files that specified the function of known parts of the genome were available for each release, and were also downloaded from 10

19 WormBase. The format of these files is described in more detail in Section 2.2.2: Removing coding regions. Although no annotation files for B. malayi or T. spiralis were publicly available at the time of this study, a FASTA file with all the known coding sequences for B. malayi was obtained from Blaxter (2007). 2.2 Finding CNEs Vavouri et al. s (2007) methodology was used as the starting point for identifying CNEs. Their procedure for a pair of species was repeated in this study for ten pairs (all possible pairs for five species, as in Table 3.1). The steps and parameters were initially kept identical to verify that the programs for this project were finding CNEs the same way. Subsequently, new parameter sets were tried which are described in Chapter 3. The overall process to find CNEs between two species was to first find the conserved portions (the parts that are recognizably similar) and then to remove those parts that overlap a known coding region. In each of the pairs in the first column of Table 3.1, the first species in the pair had better annotations, and was used as the reference against which coding regions were found and removed. All the programs ending in.pl in Figure 2.2 were developed during the course of this thesis for finding CNEs. These programs are described in the next few subsections (highlighted in italics) along with other publicly available programs that were used at each stage of the process to find CNEs for each pair of species. Additionally, an overall script pipeline.pl was written that called these programs with the appropriate parameters for each pair of species Finding conserved portions To identify conserved portions between two species, the program megablast (Zhang et al., 2000) was used. Megablast takes a query sequence and compares it against a target database to find all the subsequences of the query that have a hit (match) against the database. The program uses a heuristic that is much faster than the dynamic programming algorithm for finding sequence alignments (Smith and Waterman, 1981). Although megablast is theoretically not guaranteed to find all the alignments between two sequences, in practice it almost always finds the best alignments. For this thesis, all the parameters for megablast were kept at their default values, except the word seed size (-W, which specifies the number of contiguous nucleotides that must be identical in an alignment) and the e-value threshold (-e, which is a statistical estimate of how often that alignment is likely to occur by chance in a target database 11

20 Figure 2.2: Steps for finding CNEs 12

21 Õ Ø ± Ð Ò Ñ Ô Õ Ø Õ Ò Ø Ø Ø Ò Ø Á ÖÁ ½¼¼º¼¼ ¼ ¼ ¾ ¼½½ ¼½½ ½ ¹ ¼ ½ Á ÖÁÎ º ½ ½ ¾ ¾¾ ¾¾ ¹ ¼ ½ Á ÖÁÁ ½¼¼º¼¼ ¼ ¼ ½ ¾ ¾ ¹¾ ½ µ Á ÖÍÒ º ¹ ¹ ¹ ¾ ¹ ¹ ¹ ½ Á ÖÁÁ ½¼¼º¼¼ ¼ ¼ ½ ¾ ¾ ¹¾ ½ µ Figure 2.3: (a) Fragment of a megablast results file (column headers: q = query identifier, t = target database identifier, %id = percentage identity, len = length of alignment, mis = number of mismatches in alignment, gap = number of gaps, q_st = starting coordinate of query sequence, q_en = query end, t_st = target start, t_en = target end, e = expect value, bit = bit score) and (b) the output of combile-mbl.pl for that fragment. of that size). The initial megablast parameters (-W 30 and -e 0.001) were the same as Vavouri et al. (2007), though lower word seed sizes and less stringent e-value thresholds were also tried as reported in the Results chapter. This set of parameters was not very sensitive, and alignments shorter than 30 nucleotides were missed by definition (and some that were longer than 30 were also missed because alignments can have gaps). In comparison, Woolfe et al. (2005) used -W 20 when they found CNEs between human beings and pufferfish. The tabular format was chosen for hits returned by megablast and the better annotated genome was used as the query sequence in all the pairwise comparisons. Once megablast had run, overlapping hits on the query sequence were combined using combine-mbl.pl because it was assumed that two adjacent or overlapping hits represent the same CNE. Figure 2.3 provides an example of how combine-mbl.pl works. The query identifier remained the same in the combined megablast result, but the target database identifier was left blank because the hits could have been with different parts of the target database. Target start and end coordinates were also left out for the same reason. The percentage identity and the bit-scores of the combined result were determined by taking the lowest values out of the results that were combined (Dubchak et al., 2000). Finding conserved regions in this way was not symmetric because overlapping regions were combined only for the better annotated genome. However, this was a reasonable simplification because the better annotated genome was the one on the basis of which overlaps with coding regions were determined, as described in the next subsection. The output of combine-mbl.pl was the starting collection of conserved elements (Fig. 2.2). The steps for removing coding regions from this collection are described next. 13

22 Figure 2.4: Ten sample lines of GFF annotation file for C. briggsae Chromosome I. The highlighted entries depict coding regions. If the coordinates of a megablast result overlapped these coordinates, then it was discarded as a conserved coding region. Eventually, only putative conserved non-coding elements (CNEs) remained. See Appendix for the complete list Removing coding regions Continuing with the process developed by Vavouri et al, only those conserved portions for each species pair were retained that did not overlap any known coding regions. Identification of the coding regions was a multi-step process that included checking genome annotations, looking for transfer-rna (trna) coding regions, and matching against known expressed sequence tags (ESTs) for that species. Additionally, low-complexity repeats in the genome, and known elegans repeats were also marked and removed with the help of the RepeatMasker software package (Smit et al., ). Filtering megablast results against genome annotations The most important step in deciding if a conserved segment overlapped a coding region was to check it against the genome annotation. Annotations in the General Feature Format (GFF) were downloaded from WormBase (Bieri et al., 2007) for C. elegans, C. briggsae, and C. remanei. These three annotation GFFs were sufficient for checking 9 of the 10 pairwise comparisons, but no GFF was available for B. malayi so the B. malayi T. spiralis pair was processed differently as described later in this section. The GFF file fragment in Figure 2.4 lists several fields, but the important ones for this project were the first five: seqname, source, feature, start, and end. Each line represents a feature at a particular location on the genome. seqname was used to identify the chromosome or contig for which the annotation was provided, and the next two fields were used to determine if that annotation was for a coding region or not. The start and end fields mark the coordinates of that feature on that chromosome or contig. The Appendix lists the source and feature combinations found in the three GFF files for C. elegans, C. briggsae, and C. remanei. This list was examined manually and features 14

23 ÓÙÖ ØÙÖ Ó Ò Ä Ì Ö Ø ÒÙÐ ÓØ Ñ Ø Ä Ì Ð Ò Ø ÒÙÐ ÓØ Ñ Ø ÙÖ Ø Ë ÙÖ Ø Ó Ò ÜÓÒ ÙÖ Ø ÜÓÒ ÙÖ Ø ÒØÖÓÒ Û ØÖÓÒ ÒÙÐ ÓØ Ñ Ø Figure 2.5: Fragment of coding region file with asterisks used to tag GFF source and feature combinations that specified a coding region. that referred to coding regions were tagged with an asterisk (Blaxter, 2007, Vavouri et al., 2007), and stored in a tab-separated file (Fig. 2.5). This tab-separated file for identifying exons was then used in program clean-sort-gff.pl to pull out all the lines in the GFF file that referred to coding regions. clean-sort-gff.pl combined the coordinates of the coding regions (if they overlapped) and only wrote out the start and end coordinates of the combined region to a file that was created for each chromosome or contig referred to in the GFF file. Preprocessing the GFF file in this way into a sorted, non-overlapping set of start and end coordinate pairs for each known coding region was a major optimization. The simplified coding region coordinate file became sufficiently short that it could be loaded into memory, and could be binary searched to see if a megablast result overlapped a coding region. Whereas naive code for checking each megablast result against the entire GFF annotation file took almost 15 hours on a high-end workstation, this optimization sped up the process by a factor of almost 20,000. (e.g., 36,000 megablast results for the C. elegans C. briggsae pairwise comparison could be checked against the C. elegans GFF with 15 million records in less than 3 seconds). The B. malayi T. spiralis pair of species was tackled differently as no GFF annotation was publicly available for the B. malayi genome. The list of conserved elements after the megablast step was converted to a FASTA file (using mbl2fasta.pl, described in more detail in the next section) and this FASTA file was blasted (i.e., program blastn Altschul et al., 1997 was used to find matches between the two sets of sequences) against a database of known B. malayi coding sequences, also in FASTA format. The conserved regions in the B. malayi T. spiralis pair that matched the coding sequences for B. malayi with an e-value less than were removed, leaving a set of putative CNEs for this pair. Converting filtered megablast results to FASTA In the previous step, megablast results for each pair of species were checked against GFF annotations and those that overlapped coding regions were removed. The puta- 15

24 tive CNEs were still in megablast output format (specified as a chromosome or contig location, along with the start and stop coordinates). The next set of steps required the putative CNEs to be in FASTA format so that the actual nucleotides could be checked to remove additional coding regions that were missed by the GFF checking step. Program mbl2fasta.pl converted each putative CNE in megablast output format to a FASTA sequence by looking up the appropriate genome sequence file, finding the right chromosome or contig, and pulling out the nucleotides from the start to the stop coordinate. CNEs for the B. malayi T. spiralis pair had already been converted into FASTA format in the previous step, so mbl2fasta.pl was not run for that pair. Using RepeatMasker to scan for simple or known elegans repeats RepeatMasker (Smit et al., ) is a program that finds interspersed repeats and low-complexity DNA sequences (such as ATATAT... ) and masks these repeats by replacing the repeated nucleotides with a series of Ns. Simple repeats like these are very frequent in the genomes of all species and can lead to uninformative alignments or matches when looking for conserved sequences. RepeatMasker was run on putative CNEs from the previous step at the slowest, most sensitive setting, using cross_match (Ewing and Green, 1998) as the comparison engine, and the RepBase library (Jurka et al., 2005) of known repeats for C. elegans. The program remove-rm.pl was then used to remove all CNEs that contained more than 80% repeats. RepeatMasker could have been run first for each species before megablast was used to find conserved regions, but it takes a long time to mask repeats in large genomes, so it was more optimal to first run megablast (a very fast algorithm), and then run RepeatMasker only on the conserved regions found. Using trnascan to scan for trna coding regions Once simple repeats had been removed from the putative CNEs, trnascan (Lowe and Eddy, 1997) was run to find regions which matched known trna coding genes. Some of these regions were removed at the GFF checking stage because some of the better annotations included information on trna coding regions. Similar to the previous step, regions identified by trnascan as trna coding genes were removed using the program remove-trna.pl. Removing known Rfam and mirna regions Continuing the process of filtering through putative CNEs to remove all known coding regions, the next step was to check the Rfam and micro-rna (mirna) databases. 16

25 The Rfam database (Griffiths-Jones et al., 2005) has information about families of noncoding RNA and other structural RNA elements, and the mirna database (Griffiths- Jones, 2004) contains predicted hairpin portions of mirna transcripts. Putative CNEs were blasted against both these databases (using blastn with -e set to ) and any CNEs that showed hits in these two databases were removed using remove-blastcoding.pl. Removing ESTs for poorly annotated species The final step in Vavouri et al s method for discovering CNEs was to look for matches between the putative CNEs for a pair of species and the Expressed Sequence Tag (EST) databases for those species. An EST is a low-cost, low-quality sequence of nucelotides obtained by sequencing cloned mrnas. Because they are obtained from mrnas, a match to an EST is a positive indicator of a coding region, even though it may not be a protein coding gene. EST sequences were downloaded from EBI (Harte et al., 2004) for all the poorly annotated species in this study (i.e. all except C. elegans). As in the previous step, putative CNEs were blasted against the EST sequences for these species, and all sequences that had a match were removed using remove-blastcoding.pl. Finding and removing matches against other databases and genomes The steps described so far for removing coding regions from a set of conserved elements between two species were proposed by Vavouri et al. (2007). Additionally, Blaxter (2007) suggested blasting (i.e., using programs from NCBI s blast suite of programs) the remaining CNEs against three other sequence sources to be sure that the CNEs obtained were non-coding: blastx against NemPep3: NemPep3 (Wasmuth and Blaxter, 2006) is an exhaustive database of protein sequences from the phylum Nematoda. blastx was used to compare nucelotide sequences against the NemPep3 protein sequence database. blastx against UniRef90: UniRef90 (Harte et al., 2004) is a non-redundant reference database of all proteins in the UniProt database. Protein sequences with 90% identity are clustered together in UniRef90. tblastx against C. elegans genome (for pairs without C. elegans): tblastx compares a nucleotide query sequence against a nucleotide target database, after translating all six reading frames of both sets of sequences. Because C. elegans had the 17

26 best annotated genome, CNEs from pairs that were not checked against the C. elegans GFF were checked against the C. elegans genome to see if any of the CNEs matched known elegans coding regions. These three searches addressed the same issue: looking for and removing known protein coding regions (especially in nematodes) from the set of putative CNEs. In all three cases, the CNEs with blast hits satisfying an e-value less than were removed using remove-blast-coding.pl. 2.3 Determining CNE similarities The megablast program used for finding conserved regions returns two measures of similarity for each sequence alignment that it finds: Percentage identity is the most basic measure of similarity between two sequences and is defined as the percentage of total bases in the alignment that are identical between the two sequences. Bit-score is based on the raw score obtained by summing the positive scores for each nucleotide match and the negative scores for each nucleotide mismatch or gap. The raw score is normalized to give the bit-score. Percentage identity and bit-scores are impossible to derive from the megablast results alone when more than two megablast results are combined (overlapping hits were combined to give a consolidated region according to the procedure by Vavouri et al., 2007). Therefore, based on Dubchak et al. (2000), the lowest percentage identity or lowest bit-score of a set of overlapping hits was taken to represent the overall percentage identity or bit-score of the combined region. These two measures were used (along with the length of the CNE) to indicate the level of conservation of each pairwise CNE. However, the goal of this thesis was to see if the level of conservation of a CNE changed for different pairs of species. Therefore, a way was needed to establish if two CNEs from different pairwise comparisons represented the same canonical CNE. Blastclust (NCBI, 2007) was used to cluster the CNEs on the basis of their sequence similarities. CNEs belonging to the same cluster had very similar sequences, and, if they came from different pairs of species, then the levels of identity of each CNE could be compared based on the time since divergence of the species in that pair. 18

27 For performing all the statistical analyses and creating plots of the levels of identity, Mathematica 6.0 (Wolfram Research Inc, 2007) was used. Mathematica s list and set processing capabilities were especially useful in selecting clusters that contained CNEs from the desired pairs of species. The results of finding the CNEs, and comparing their similarities across different species pairs based on how long ago the pair diverged, are reported in the next chapter. 19

28 Chapter 3 Results This chapter presents the results of using the pipeline described in Chapter 2 and systematically varying the parameters for finding CNEs. The first part presents the counts of CNEs that were discovered for each pair of species at different levels of sensitivity, and the second part presents CNEs that were shared across multiple pairwise comparisons. One of the main goals of this study was to identify the shared CNEs and see how their level of identity changes depending on how long ago the species they came from had diverged. Although a few CNEs were found for pairs of species that diverged furthest in the past, none of those CNEs are seen in other pairwise comparisons, and so it is impossible to study how those CNEs have changed in identity over time. The third part of the results concentrates on the properties of CNEs for the three species for which the most CNEs were found: C. elegans, C. briggsae, and C. remanei. Unfortunately, these three species diverged from each other relatively recently (100 MYA compared to 500 MYA for B. malayi and 700 MYA for T. spiralis) so they cannot be used to trace a long evolutionary history of CNEs. The fourth and final part describes some aggregate properties of the CNEs found in each pairwise comparison. The results are clear that very few or no CNEs were found for pairs of species that diverged earliest, and that none of the thousands of CNEs found for the three species that diverged most recently were shared beyond that group. This is the main finding of this study and is presented in the Discussion (Chapter 4) with more context, along with the implications of such a finding. 20

29 Table 3.1: CNEs found for all ten pairs (in alphabetical order) of five nematode species using the method and parameters from Vavouri et al. (2007) for finding CNEs (with megablast parameters -W 30, -e 0.001) Pair hits Combine hits Filter against GFF of better annotated species Remove megablast megablast repeatmasked Remove trna Remove Rfam, mirna, and EST matches B. malayi T. spiralis * C. briggsae B. malayi C. briggsae C. remanei C. briggsae T. spiralis C. elegans B. malayi C. elegans C. briggsae C. elegans C. remanei C. elegans T. spiralis C. remanei B. malayi C. remanei T. spiralis All pairs *For B. malayi, no annotation GFF was available, so the megablast hits were blasted against a database of known B. malayi coding sequences, and matches were removed. 3.1 CNE counts CNEs found using methodology and parameters from Vavouri et al. Table 3.1 summarizes the numbers of CNEs found at each stage of the pipeline. The last column lists the number of CNEs found after all the steps proposed by Vavouri et al. (2007). The first step for finding CNEs as described in Chapter 2 is to use megablast to find the conserved regions and this table displays the CNEs found with word seed size 30 (-W 30) and e-value threshold (-e 0.001) as described in Vavouri et al. The sixth row in Table 3.1 (C. elegans C. briggsae) acts as a confirmation that the programs written for this project did what they were supposed to do. The number of CNEs found for the pairwise comparison between C. elegans and C. briggsae corresponds well with Vavouri et al s findings of 3061 putative CNEs using the same method between the same pair of species. To further verify that the pipeline worked as intended, the set of elegans and briggsae CNEs found was blasted against the set of CNEs published as supplemental online material by Vavouri et al. (i.e., the program blastn was run with stringent settings and all the CNEs matched consistently). The number of putative CNEs for that pair reported here is lower than their finding because of more annotation information available for C. elegans since their study, more ESTs for C. briggsae in Harte et al. (2004) that were checked against, and updated genome sequences for C. elegans and C. briggsae. 21

30 It is immediately apparent from the last column in Table 3.1 that no CNEs were found at this sensitivity level (-W 30, -e 0.001) for two pairs (C. elegans B. malayi, and C. elegans T. spiralis), practically none were found for three other pairs (B. malayi T. spiralis, C. remanei B. malayi, and C. remanei T. spiralis), and very few were found for two other pairs (C. briggsae B. malayi, and C. briggsae T. spiralis). The only pairs for which thousands of CNEs were found were the pairwise comparisons between the three Caenorhabditis species - elegans, briggsae, and remanei. These three comparisons resulted in thousands of CNEs found, more in keeping with the numbers expected on the basis of past studies. The pairs for which no CNEs were found at all using Vavouri et al s methodology and their levels of sensitivity for detecting conserved regions, were the ones between the best annotated species (C. elegans) and the species that diverged earliest from the others (B. malayi and T. spiralis). This finding may indicate that the few CNEs that were found in other pairwise comparisons with B. malayi or T. spiralis are the result of poorly annotated genomes where some coding regions have not yet been removed as thoroughly as they were for C. elegans. The big-bang hypothesis that this study set out to verify ( CNEs emerged once, several hundred million years ago ) has thus been dealt a severe blow, because the very first look at pairwise CNEs seems to indicate that there are no or very few conserved noncoding elements found when comparing nematode species that diverged more than 100 MYA. This is a strong claim and is examined in more detail in the Discussion chapter. Just to be sure that the megablast program heuristic was not accidentally leaving some conserved regions out, all the CNEs found in pairwise comparisons between the three Caenorhabditis species were blasted against the B. malayi and T. spiralis genomes (using program blastn with e-value threshold and word seed size 30) and no hits were returned. The next two parts of this section report the CNEs found at less stringent levels of sensitivity, with some additional steps to ensure that all coding regions were being identified as best as possible CNEs found after additional steps to remove coding regions Three additional steps were performed to identify and remove coding regions in the conserved portions of pairwise comparisons. For details, see Section on Removing Coding Regions in the chapter on Methods and Materials. Performing the additional steps (removing putative CNEs which had hits at e-value threshold in blastx searches against NemPep3 and UniRef90, and a tblastx search against the coding regions of the well annotated C. elegans genome) gives us Table 3.2. The CNE 22

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST

COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST Big Idea 1 Evolution INVESTIGATION 3 COMPARING DNA SEQUENCES TO UNDERSTAND EVOLUTIONARY RELATIONSHIPS WITH BLAST How can bioinformatics be used as a tool to determine evolutionary relationships and to

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

5/4/05 Biol 473 lecture

5/4/05 Biol 473 lecture 5/4/05 Biol 473 lecture animals shown: anomalocaris and hallucigenia 1 The Cambrian Explosion - 550 MYA THE BIG BANG OF ANIMAL EVOLUTION Cambrian explosion was characterized by the sudden and roughly simultaneous

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

PHYLOGENY AND SYSTEMATICS

PHYLOGENY AND SYSTEMATICS AP BIOLOGY EVOLUTION/HEREDITY UNIT Unit 1 Part 11 Chapter 26 Activity #15 NAME DATE PERIOD PHYLOGENY AND SYSTEMATICS PHYLOGENY Evolutionary history of species or group of related species SYSTEMATICS Study

More information

MegAlign Pro Pairwise Alignment Tutorials

MegAlign Pro Pairwise Alignment Tutorials MegAlign Pro Pairwise Alignment Tutorials All demo data for the following tutorials can be found in the MegAlignProAlignments.zip archive here. Tutorial 1: Multiple versus pairwise alignments 1. Extract

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information

Heuristic Alignment and Searching

Heuristic Alignment and Searching 3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Computational Structural Bioinformatics

Computational Structural Bioinformatics Computational Structural Bioinformatics ECS129 Instructor: Patrice Koehl http://koehllab.genomecenter.ucdavis.edu/teaching/ecs129 koehl@cs.ucdavis.edu Learning curve Math / CS Biology/ Chemistry Pre-requisite

More information

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST Introduction Bioinformatics is a powerful tool which can be used to determine evolutionary relationships and

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Introduction to Sequence Alignment. Manpreet S. Katari

Introduction to Sequence Alignment. Manpreet S. Katari Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

Standards A complete list of the standards covered by this lesson is included in the Appendix at the end of the lesson.

Standards A complete list of the standards covered by this lesson is included in the Appendix at the end of the lesson. Lesson 8: The History of Life on Earth Time: approximately 45-60 minutes, depending on length of discussion. Can be broken into 2 shorter lessons Materials: Double timeline (see below) Meter stick (to

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Biased amino acid composition in warm-blooded animals

Biased amino acid composition in warm-blooded animals Biased amino acid composition in warm-blooded animals Guang-Zhong Wang and Martin J. Lercher Bioinformatics group, Heinrich-Heine-University, Düsseldorf, Germany Among eubacteria and archeabacteria, amino

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology

SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION. Using Anatomy, Embryology, Biochemistry, and Paleontology SCIENTIFIC EVIDENCE TO SUPPORT THE THEORY OF EVOLUTION Using Anatomy, Embryology, Biochemistry, and Paleontology Scientific Fields Different fields of science have contributed evidence for the theory of

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004

CSE : Computational Issues in Molecular Biology. Lecture 6. Spring 2004 CSE 397-497: Computational Issues in Molecular Biology Lecture 6 Spring 2004-1 - Topics for today Based on premise that algorithms we've studied are too slow: Faster method for global comparison when sequences

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Annotation of Drosophila grimashawi Contig12

Annotation of Drosophila grimashawi Contig12 Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes 3 2.1 Genscan Feature 12.4............................................. 3 2.1.1 Genome Browser:

More information

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab

Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab Date: Agenda Warm-Up- Review Natural Selection and Reproduction for quiz today!!!! Notes on Evidence of Evolution Work on Vocabulary and Lab Ask questions based on 5.1 and 5.2 Quiz on 5.1 and 5.2 How

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Review sheet for the material covered by exam III

Review sheet for the material covered by exam III Review sheet for the material covered by exam III WARNING: Like last time, I have tried to be complete, but I may have missed something. You are responsible for all the material discussed in class. This

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Comparing Genomes! Homologies and Families! Sequence Alignments!

Comparing Genomes! Homologies and Families! Sequence Alignments! Comparing Genomes! Homologies and Families! Sequence Alignments! Allows us to achieve a greater understanding of vertebrate evolution! Tells us what is common and what is unique between different species

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Emily Blanton Phylogeny Lab Report May 2009

Emily Blanton Phylogeny Lab Report May 2009 Introduction It is suggested through scientific research that all living organisms are connected- that we all share a common ancestor and that, through time, we have all evolved from the same starting

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

More information

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17.

Genetic Variation: The genetic substrate for natural selection. Horizontal Gene Transfer. General Principles 10/2/17. Genetic Variation: The genetic substrate for natural selection What about organisms that do not have sexual reproduction? Horizontal Gene Transfer Dr. Carol E. Lee, University of Wisconsin In prokaryotes:

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Pyrobayes: an improved base caller for SNP discovery in pyrosequences Pyrobayes: an improved base caller for SNP discovery in pyrosequences Aaron R Quinlan, Donald A Stewart, Michael P Strömberg & Gábor T Marth Supplementary figures and text: Supplementary Figure 1. The

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches Int. J. Bioinformatics Research and Applications, Vol. x, No. x, xxxx Phylogenies Scores for Exhaustive Maximum Likelihood and s Searches Hyrum D. Carroll, Perry G. Ridge, Mark J. Clement, Quinn O. Snell

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

The Theory of Evolution

The Theory of Evolution The Theory of Evolution Matthew Ferry Evolution The process by which different kinds of living organisms are thought to have developed and diversified from earlier forms during the history of the Earth.

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

18.4 Embryonic development involves cell division, cell differentiation, and morphogenesis

18.4 Embryonic development involves cell division, cell differentiation, and morphogenesis 18.4 Embryonic development involves cell division, cell differentiation, and morphogenesis An organism arises from a fertilized egg cell as the result of three interrelated processes: cell division, cell

More information