Evolutionary dynamics of conserved. non-coding DNA elements: Big bang. or gradual accretion? Sujai Kumar

Size: px

Start display at page:

Download "Evolutionary dynamics of conserved. non-coding DNA elements: Big bang. or gradual accretion? Sujai Kumar"

Lillian Rich
5 years ago
Views:

1 Evolutionary dynamics of conserved non-coding DNA elements: Big bang or gradual accretion? Sujai Kumar Master of Science School of Informatics University of Edinburgh 2007

2 Abstract Background Previous studies have found that DNA elements are highly conserved in species from the same lineage, even though they do not code for proteins or RNA. One proposed function of such conserved non-coding elements (CNEs) is that they are cis-regulatory sequences for developmental genes which act as an abstraction of genetic regulatory networks, thus allowing new animal body plans to be specified in a modular way. This thesis tests the specific proposal by a previous study that CNEs arose in a big bang in the Precambrian, approximately 600 million years ago. Results The evolutionary dynamics of CNEs were studied by first identifying the elements, and then examining their levels of identity over time. Pairwise comparative sequence analysis of five contemporary nematode species provided a window into the past because these species diverged at different points of time over the last approximately 700 million years. The number of CNEs and their basic properties for the three most recently diverged species match the results obtained by other researchers, although no clear trend is visible in the change in identity of CNEs with respect to time since divergence. On adding two more species to the analysis, it was found that no such elements could be identified for species pairs with deep divergences. Conclusions The absence of CNEs for pairwise comparisons of species that diverged earliest indicates that CNEs did not arise in a big bang. CNEs that were found for the three Caenorhabditis species that diverged relatively recently (approximately 100 million years ago) seem to be specific only to that clade. However, the big bang hypothesis cannot be conclusively discarded because it is possible that the elements exist, but are short, or have multiple components spread across the genome, and are therefore difficult to detect. Missing CNEs could therefore be a limitation of computational approaches to discovering CNEs, and this study also suggests some ways to overcome those limitations. i

3 Acknowledgements I am very grateful to Alasdair Anthony and Ann Hedley at the Institute of Evolutionary Biology for getting me started on the mechanics of this project. Other members of the lab group also patiently heard my semi-formed thoughts on conserved non-coding elements, asked penetrating questions, and offered useful advice from time to time. Many thanks are also due to Douglas Armstrong for providing access to excellent computing resources and for helping with all administrative aspects of the MSc course. Most importantly, I would like to thank Mark Blaxter whose enthusiasm for life and all living things is contagious. He was the inspiration behind this project and provided much advice, encouragement, and pizza over the course of the summer. ii

4 Contents 1 Introduction Conserved Non-coding Elements (CNEs) Conservation of DNA Non-coding regions of the genome Properties and proposed functionality of CNEs Hypothesis and approach Scope Structure Methods and Materials Obtaining genome sequences Finding CNEs Finding conserved portions Removing coding regions Determining CNE similarities Results CNE counts CNEs found using methodology and parameters from Vavouri et al CNEs found after additional steps to remove coding regions CNEs found for higher sensitivity levels CNEs shared across all pairs of species CNEs for C. elegans, C. brigssae, and C. remanei Aggregate properties of CNEs iii

5 4 Discussion Rejection of big bang hypothesis Limitations of current study, and future work Appendix: Coding regions in GFF files 37 Bibliography 40 iv

6 List of Figures 1.1 Phylogenetic tree of five nematode species compared in this study (Caenorhabditis divergences from Stein et al., 2003; B. malayi and T. spiralis divergences from Vanfleteren et al., 1994) Identity vs time plots to verify evolutionary dynamics of CNEs. A minimum 25% identity is expected in all cases (dotted line) because sequences are made up of only four bases: A, T, G, and C. Because background nucleotide concentrations are biased (e.g. lower G-C levels), the minimum level of identity would be higher than 25% (dotdashed line) Example fragment of a nucleotide FASTA file Steps for finding CNEs (a) Fragment of a megablast results file (column headers: q = query identifier, t = target database identifier, %id = percentage identity, len = length of alignment, mis = number of mismatches in alignment, gap = number of gaps, q_st = starting coordinate of query sequence, q_en = query end, t_st = target start, t_en = target end, e = expect value, bit = bit score) and (b) the output of combile-mbl.pl for that fragment Ten sample lines of GFF annotation file for C. briggsae Chromosome I. The highlighted entries depict coding regions. If the coordinates of a megablast result overlapped these coordinates, then it was discarded as a conserved coding region. Eventually, only putative conserved non-coding elements (CNEs) remained. See Appendix for the complete list Fragment of coding region file with asterisks used to tag GFF source and feature combinations that specified a coding region v

7 3.1 Characteristics of CNEs found for each pair (y axis), plotted against the time since divergence of the species in that pair (x axis): a) length, b) bit-score, and c) percentage identity. Because several pairs share the same time since divergence (such as C. elegans C. briggsae, and C. elegans C. remanei, both 100 MYA), this plot jitters the locations along the x-axis to make it easy to identify the data points for each pair CNE percentage identity versus length, visualized as a scatterplot and as a 3D histogram, for C. briggsae C. remanei, C. elegans C. briggsae, and C. elegans C. remanei vi

8 List of Tables 1.1 Level of conservation of CNEs in different groups of species Sources for whole genome sequences CNEs found for all ten pairs (in alphabetical order) of five nematode species using the method and parameters from Vavouri et al. (2007) for finding CNEs (with megablast parameters -W 30, -e 0.001) CNEs found after additional checks to determine coding regions (-W 30, -e 0.001) CNEs found using different sensitivity parameter settings Results of clustering CNEs found at different sensitivity levels Comparisons of mean length, bit-scores, and percentage identity for CNEs shared in two comparisons: C. briggsae C. remanei (diverged 80 MYA), and C. elegans C. briggsae (diverged 100 MYA) Comparisons of mean length, bit-scores, and percentage identity for CNEs shared in two comparisons: C. briggsae C. remanei (diverged 80 MYA), and C. elegans C. remanei (diverged 100 MYA) vii

9 Chapter 1 Introduction The increasing availability of full genome sequences has led to many comparative studies that have examined the non-protein-coding part of genomes. Over the last decade, several non-coding elements have been found that are completely conserved or conserved with a high degree of identity in species as diverse as Homo sapiens and Fugu rubripes (the Japanese pufferfish) which last shared a common ancestor approximately 450 million years ago (MYA). This level of conservation indicates that such sequences are functional even though they do not code for proteins or RNA. The real function of such elements remains an open question. Understanding the evolution of non-coding DNA has the potential to address questions such as how genomes evolved and how they are still evolving. More importantly, it gives us a way to attempt to answer fundamental questions such as how the incredible complexity of life came to be. This thesis analyses the evolutionary dynamics of conserved non-coding elements (CNEs). It builds on the foundation laid by Vavouri et al. (2007) where they proposed that CNEs are regulatory elements for developmental genes and that it was the rewiring" of CNEs that led to evolution of the vast diversity of animal body plans. In their study, they compared the genomes of three species from the phylum Nematoda: Caenorhabditis elegans, Caenorhabditis briggsae, and Caenorhabditis remanei. Their analysis is replicated here, and two additional species from the same phylum were added for which full genome sequences have recently become available: Brugia malayi and Trichinella spiralis. These five species last shared a common ancestor more than 600 MYA and the comparative analysis in this thesis helps answer whether CNEs arose only once (in a big bang ) in the Precambrian as proposed by Vavouri et al., or emerged gradually through evolutionary history. Answering this question would provide us with insights into the process of evolution, may allow us to better understand how animal body plans are specified, and let us speculate whether another explosion in species diversity (of the kind seen 600 MYA) is possible in the future. 1

10 To test the big bang hypothesis, the main analytical method employed in this thesis was the level of similarity between such elements for species that diverged at different times. In subsections 1.1 and 1.2, the current understanding of CNEs is reviewed as background material for understanding the hypothesis. The hypothesis and the approach used to test it are presented in detail in subsection 1.3. In the last parts of this introductory chapter, the scope and structure of the remaining chapters of this thesis are presented. 1.1 Conserved Non-coding Elements (CNEs) Conserved non-coding elements (CNEs) are a recent discovery in several cross-species comparisons. Research groups have not yet decided on a common term for them and each has its own acronym for such sequences: CNE - Conserved Non-coding Element (Vavouri et al., 2007, 2006, Woolfe et al., 2005) UCR - Ultra-Conserved non-coding Region (Sandelin et al., 2004) MCS - Multi-species Conserved Sequence (Margulies et al., 2003) CNG - Conserved Non-Genic sequence (Dermitzakis et al., 2005) HCE - Highly Conserved Element (Siepel et al., 2005) This thesis uses the term CNE because it builds on the research and claims made by Vavouri et al. (2007, 2006) and Woolfe et al. (2005), and because the term captures both key aspects of such sequences: that they are conserved across species, and that they are non-coding for proteins or RNA Conservation of DNA Conserved DNA sequences are interesting because they indicate that the sequences are functional. Non-functional sections of the genome undergo mutation and drift apart as species diverge away from each other. Functional sections of the genome remain recognisably similar over long time periods because they code for proteins, code for RNA, are structural, or act as regulatory sites for enhancers, promoters, repressors, and so on. If a section of DNA has such a functional role, it will be under purifying selection and is likely to remain the same or similar over millions of years of mutation pressure. This is the key idea behind all comparative analyses across species, and is a way of identifying functional parts of the genome. 2

11 1.1.2 Non-coding regions of the genome Historically, protein-coding genes were the focus of genome studies (Bird et al., 2006) and a sequence was considered interesting only if it was transcribed as a protein or as RNA. As better experimental and informatics technologies were developed, it was discovered that protein and RNA coding genes only account for small proportions of the whole genome (1.5% to 25% in animal genomes). Comparative genomics studies have identified non-coding regions that appear to be highly conserved and although some parts are now understood to be a complex interacting network of regulatory elements, the functionality of other non-coding parts remains unknown Properties and proposed functionality of CNEs CNEs have been identified for groups of vertebrates and invertebrates separately. Although no sequence identity has been discovered so far between CNEs in vertebrates and CNEs in invertebrates, they share characteristics such as: High levels of identity (higher than that of protein-coding genes in most cases), across a wide range of species: Table 1.1 summarizes the level of conservation of CNEs for different groups of species. Although the data are from different sources and do not use the same measures of identity, the figures provide a general idea of how much these sequences are conserved even in the case of species that diverged approximately 450 MYA. Clustering around genes: The density of CNEs is higher in gene-rich regions in humans (Bejerano et al., 2004, Sandelin et al., 2004, Woolfe et al., 2005) and nematodes (Vavouri et al., 2007), with several CNEs clustered around each gene. Association with developmental genes: Gene association is determined by looking for the transcription start site nearest to each CNE. CNE-associated genes seem to be enriched for regulators of development such as transcription factors and signalling genes (Sandelin et al., 2004, McEwen et al., 2006). CNEs also exhibit other interesting properties that are not yet understood, such as a spike in AT frequency just inside CNE boundaries (in sharp contrast to flanking regions, Vavouri et al., 2007) and that their AT frequencies are similar (~65%) across species despite the background AT content of each genome being different. Based on the properties listed above (high identity levels, association with developmental genes, specificity to phyla) and on experiments testing the functionality of 3

12 Table 1.1: Level of conservation of CNEs in different groups of species Species Compared Number of CNEs and Level of Fruit Flies (Glazov et al., 2005): Drosophila melanogaster, Drosophila pseudoobscura Mammals (Bejerano et al., 2004): Homo sapiens (Human), Mus musculus (Mouse), Rattus norvegicus (Rat) Nematodes (Vavouri et al., 2007): Caenorhabditis elegans, Caenorhabditis briggsae, Caenorhabditis remanei Vertebrates (Woolfe et al., 2005): Homo sapiens (Human), Takifugu rubripes (Pufferfish) Identity elements; 100% identity; Length > 50bp 256 elements; 100% identity; Length > 200bp 2084 elements; megablast word seed size 30bp (W30) with e-value threshold 0.001; Average length 69bp 1373 elements; 84% identity; Average length ~200bp Last Common Ancestor MYA 55 MYA 100 MYA 450 MYA CNEs, the most likely function of CNEs is that they are cis-elements that regulate the transcription of a core set of developmental regulatory genes in each species. Cis-elements are regions of DNA that lie on the same strand as the gene they regulate. Genetic regulatory networks (GRNs) use cis-elements extensively to regulate the complex production of proteins with the help of biological controls such as signal transducers, switches, feedback loops, feedforward loops, and combinatorial functions such as and and or relationships (Andrianantoandro et al., 2006). Development of the animal body plan is controlled by large GRNs and changes in core developmental GRNs can result in new animal body plans (Davidson and Erwin, 2006). Vavouri et al. (2007) proposed that the initial emergence and subsequent modification of CNEs associated with GRNs was responsible for the evolution of new animal body plans. According to this theory, each animal group has a different set of CNEs because the core GRN for that animal group evolved as a result of the rewiring of CNEs. The vast diversity of body plans first seen in the fossil record of the Cambrian indicates that an evolutionary explosion started in the Precambrian and it is possible that the non-coding elements conserved in modern species are hard-wired traces of the changes that took place in core developmental regulatory networks. Because CNEs were highly conserved across species in the same family (e.g., mammals, Dermitzakis et al., 2005) and showed no conservation at all across species from different families (e.g., between humans and nematodes), it is plausible that CNEs are linked in this way to the specification of body plans. The claim that CNEs arose in a big bang in a short period of time around the Precambrian and are responsible for the profusion of animal body plans at that time, is interesting and is testable. If the claim is true, it should be possible to identify the same CNEs in all branches of a phylum. The alternative is that CNEs arise from time to time, and that new CNEs can be seen in every branch of the phylogenetic tree. 4

13 Figure 1.1: Phylogenetic tree of five nematode species compared in this study (Caenorhabditis divergences from Stein et al., 2003; B. malayi and T. spiralis divergences from Vanfleteren et al., 1994) The next subsection frames this hypothesis as a specific question, and outlines the approach taken in this study to determine the evolutionary dynamics of CNEs. 1.2 Hypothesis and approach The main goal of this study is to find evidence for or against the idea that CNEs arose in a big bang once, several hundred million years ago. The overall strategy for doing this is to compare genomes from the same family that diverged at different points of time over the last half a billion years, find CNEs in these genomes, and see how their levels of identity have changed over time. To study the question of how the CNE identities have changed, species from a single phylum are needed because CNEs are not conserved across different phyla. The phylum Nematoda is ideally suited for this purpose because five species within this phylum have recently been sequenced completely, and the approximate dates of divergence for these species span a long time period (from 80 to 700 MYA): Caenorhabditis elegans, Caenorhabditis briggsae, Caenorhabditis remanei, Brugia malayi, and Trichinella spiralis. The dates in Figure 1.1 are estimates that have an error of ±20 MYA for the three Caenorhabditis species, and the error increases to ±100 MYA for the branch points of B. malayi and T. spiralis. A CNE is found by comparing the genomes of two species, identifying the conserved elements, and removing those elements that are known to code for proteins or RNA. Thus, CNEs are defined for pairs of species for the purposes of this study, and their level of identity for each pair can be determined. Although CNEs are generally conserved with high levels of identity compared to protein-coding sequences, a pair of species that diverged relatively recently (e.g., C. briggsae and C. remanei, 80 MYA) would be expected to share CNEs with an even higher level of identity than a pair that diverged much earlier (e.g., C. elegans and B. malayi, 500 MYA). 5

14 Identity Time since divergence MYA (a) Supporting big bang model Identity Time since divergence MYA (b) Supporting gradual accretion model Figure 1.2: Identity vs time plots to verify evolutionary dynamics of CNEs. A minimum 25% identity is expected in all cases (dotted line) because sequences are made up of only four bases: A, T, G, and C. Because background nucleotide concentrations are biased (e.g. lower G-C levels), the minimum level of identity would be higher than 25% (dot-dashed line). If CNEs arose in the phylum Nematoda in a big bang, then at least some of the CNEs should be present in all ten pairwise comparisons between the five species, and one would expect the level of similarity for each pair to decrease with the time since divergence of that species. For each CNE, if a plot was created where the x-axis represented time since divergence and the y-axis represented level of similarity (percentage identity), then the plot might look something like Figure 1.2a for each CNE had it arisen in a big bang only once. On the other hand, if some CNEs are gradually recruited to the genome, then one would expect to see an identity vs time plot as in Figure 1.2b. That is, no corresponding CNEs would exist for pairs that diverged more than a certain number of years ago. The aim of this thesis is to find CNEs for each of the pairwise comparisons of the five nematode species, and to determine how the identity or similarity of CNEs depends on the time since the pair diverged. Examining the pairwise identities for each CNE would provide evidence for or against the big bang theory of CNE emergence. It is also possible that some CNEs arose in a big bang during the Precambrian, whereas others arose much later on the evolutionary timeline. Davidson and Erwin (2006) 6

15 point out that animal GRNs have levels of hierarchy. Those CNEs associated with the core or kernel GRNs might show big bang features (Fig. 1.2a) because they were responsible for specifying the nematode body plan, whereas those associated with peripheral gene networks might have arisen relatively recently within the individual branches of the phylogenetic tree (Fig. 1.2b). 1.3 Scope The analysis reported in this thesis draws extensively on existing bioinformatics tools and databases, and several new prorams were written to manage the process of finding CNEs and analysing their levels of identity for ten pairs of species. These programs were designed to be very efficient because genome annotation files for well annotated species can be several gigabytes in size and need to be searched rapidly to decide if a particular sequence is coding or non-coding. Previous research on nematode CNEs had concentrated only on the the three Caenorhabditis species: elegans, briggsae, and remanei. The complete genome sequences of Brugia malayi and Trichinella spiralis have only recently become available and this is the first study to look at conserved non-coding elements in all five species at the same time. Although past studies have explored other properties of CNEs (such as their AT frequency, gene association, etc.), this study only concentrates on the level of identity or similarity of the CNEs found in each pair of species, because the goal is to look for evidence for or against the big bang hypothesis of CNE emergence. 1.4 Structure The rest of this thesis is organized as follows. Chapter 2 describes the Methods and Materials used to carry out the study. This part includes details on all the steps used to find CNEs, ranging from descriptions of the data sources to the optimizations carried out to speed up the process of discovering coding regions that had to be eliminated to discover CNEs. Chapter 3 presents the results, beginning with a summary of the numbers of CNEs found for each pair at different sensitivity levels. The CNEs found are then clustered to determine which CNEs are shared across multiple pairwise comparisons. An analysis follows, describing the trends in identity for several thousand CNEs found in the three Caenorhabditis comparisons. The last part of Chapter 3 describes some aggregate properties of the CNEs found. 7

16 In the Discussion (Chapter 4), the results are summarized in the context of the original hypothesis. Although the evidence points to a rejection of the hypothesis, the hypothesis cannot be discarded with certainty for several reasons that are presented in detail. The limitations of this study and suggestions for future work complete the thesis. 8

17 Chapter 2 Methods and Materials The following three steps broadly describe the method for testing the hypothesis that CNEs arose in a big bang: 1. Obtain complete genome sequences for the five species being considered 2. Find CNEs (a) Find the conserved portions for each possible pair of species (b) Out of the conserved portions, remove those that overlap known coding regions to identify CNEs 3. Determine CNE similarities for pairs of species that diverged at different points of time Each step presented its own set of challenges and several choices had to be made for each of the steps above. The challenges, decisions, and the reasons for those decisions are described in the next three subsections. Several Perl scripts were written for processing the data at each step. These programs are described in this chapter and the source code for the programs is available at Obtaining genome sequences Genome sequences for each of the five species were obtained in FASTA format (Pearson and Lipman, 1988) from the sources shown in Table 2.1. FASTA files are a standard 9

18 Table 2.1: Sources for whole genome sequences. Species Source Notes C. elegans The C. elegans Sequencing Consortium (1998), sequence obtained from WormBase (Bieri et al., 2007): C. briggsae Stein et al. (2003), sequence obtained from WormBase (Bieri et al., 2007) C. remanei GSC (2007a), sequence obtained from WormBase (Bieri et al., 2007) B. malayi Ghedin et al. (2007), sequence obtained from Blaxter (2007) Release WS170 Release CB3 Assembly T. spiralis GSC (2007b) Release 1.0 way of storing and sharing sequence information in text files, and typically consist of a header line that contains the sequence identifier, followed by a series of alphabets that denote nucleotides (in the case of DNA or RNA sequences) or amino acids (in the case of protein sequences). Figure 2.1 shows the first 200 bases of the X chromosome of C. elegans in FASTA format. Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Ø Figure 2.1: Example fragment of a nucleotide FASTA file C. elegans and C. briggsae are the most extensively studied of these five species and thus had the most complete genomes (The C. elegans Sequencing Consortium, 1998, Stein et al., 2003) where all the bases were assigned to chromosomes. The other genomes were only available organized as contigs. A contig is the contiguous consensus sequence derived from overlapping DNA fragments that have been sequenced. The precise chromosomal location of a contig is not known, but for the purpose of this study the base sequences were enough to discover conserved regions. The latest releases available at the time of this study were used for the results reported here. However, while writing the programs to discover CNEs, older releases of C. elegans (WS140) and C. briggsae (CB25) were also downloaded from the FTP archives at WormBase (Bieri et al., 2007) to verify that the programs discovered exactly the same CNEs as Vavouri et al. (2007). For C. remanei, a newer assembly was available halfway through the project, but an earlier version was used because genome annotations were available for that version. B. malayi and T. spiralis had been sequenced most recently and thus only one version was available for those species. For the three Caenorhabditis species, annotation files that specified the function of known parts of the genome were available for each release, and were also downloaded from 10

19 WormBase. The format of these files is described in more detail in Section 2.2.2: Removing coding regions. Although no annotation files for B. malayi or T. spiralis were publicly available at the time of this study, a FASTA file with all the known coding sequences for B. malayi was obtained from Blaxter (2007). 2.2 Finding CNEs Vavouri et al. s (2007) methodology was used as the starting point for identifying CNEs. Their procedure for a pair of species was repeated in this study for ten pairs (all possible pairs for five species, as in Table 3.1). The steps and parameters were initially kept identical to verify that the programs for this project were finding CNEs the same way. Subsequently, new parameter sets were tried which are described in Chapter 3. The overall process to find CNEs between two species was to first find the conserved portions (the parts that are recognizably similar) and then to remove those parts that overlap a known coding region. In each of the pairs in the first column of Table 3.1, the first species in the pair had better annotations, and was used as the reference against which coding regions were found and removed. All the programs ending in.pl in Figure 2.2 were developed during the course of this thesis for finding CNEs. These programs are described in the next few subsections (highlighted in italics) along with other publicly available programs that were used at each stage of the process to find CNEs for each pair of species. Additionally, an overall script pipeline.pl was written that called these programs with the appropriate parameters for each pair of species Finding conserved portions To identify conserved portions between two species, the program megablast (Zhang et al., 2000) was used. Megablast takes a query sequence and compares it against a target database to find all the subsequences of the query that have a hit (match) against the database. The program uses a heuristic that is much faster than the dynamic programming algorithm for finding sequence alignments (Smith and Waterman, 1981). Although megablast is theoretically not guaranteed to find all the alignments between two sequences, in practice it almost always finds the best alignments. For this thesis, all the parameters for megablast were kept at their default values, except the word seed size (-W, which specifies the number of contiguous nucleotides that must be identical in an alignment) and the e-value threshold (-e, which is a statistical estimate of how often that alignment is likely to occur by chance in a target database 11

20 Figure 2.2: Steps for finding CNEs 12

21 Õ Ø ± Ð Ò Ñ Ô Õ Ø Õ Ò Ø Ø Ø Ò Ø Á ÖÁ ½¼¼º¼¼ ¼ ¼ ¾ ¼½½ ¼½½ ½ ¹ ¼ ½ Á ÖÁÎ º ½ ½ ¾ ¾¾ ¾¾ ¹ ¼ ½ Á ÖÁÁ ½¼¼º¼¼ ¼ ¼ ½ ¾ ¾ ¹¾ ½ µ Á ÖÍÒ º ¹ ¹ ¹ ¾ ¹ ¹ ¹ ½ Á ÖÁÁ ½¼¼º¼¼ ¼ ¼ ½ ¾ ¾ ¹¾ ½ µ Figure 2.3: (a) Fragment of a megablast results file (column headers: q = query identifier, t = target database identifier, %id = percentage identity, len = length of alignment, mis = number of mismatches in alignment, gap = number of gaps, q_st = starting coordinate of query sequence, q_en = query end, t_st = target start, t_en = target end, e = expect value, bit = bit score) and (b) the output of combile-mbl.pl for that fragment. of that size). The initial megablast parameters (-W 30 and -e 0.001) were the same as Vavouri et al. (2007), though lower word seed sizes and less stringent e-value thresholds were also tried as reported in the Results chapter. This set of parameters was not very sensitive, and alignments shorter than 30 nucleotides were missed by definition (and some that were longer than 30 were also missed because alignments can have gaps). In comparison, Woolfe et al. (2005) used -W 20 when they found CNEs between human beings and pufferfish. The tabular format was chosen for hits returned by megablast and the better annotated genome was used as the query sequence in all the pairwise comparisons. Once megablast had run, overlapping hits on the query sequence were combined using combine-mbl.pl because it was assumed that two adjacent or overlapping hits represent the same CNE. Figure 2.3 provides an example of how combine-mbl.pl works. The query identifier remained the same in the combined megablast result, but the target database identifier was left blank because the hits could have been with different parts of the target database. Target start and end coordinates were also left out for the same reason. The percentage identity and the bit-scores of the combined result were determined by taking the lowest values out of the results that were combined (Dubchak et al., 2000). Finding conserved regions in this way was not symmetric because overlapping regions were combined only for the better annotated genome. However, this was a reasonable simplification because the better annotated genome was the one on the basis of which overlaps with coding regions were determined, as described in the next subsection. The output of combine-mbl.pl was the starting collection of conserved elements (Fig. 2.2). The steps for removing coding regions from this collection are described next. 13

22 Figure 2.4: Ten sample lines of GFF annotation file for C. briggsae Chromosome I. The highlighted entries depict coding regions. If the coordinates of a megablast result overlapped these coordinates, then it was discarded as a conserved coding region. Eventually, only putative conserved non-coding elements (CNEs) remained. See Appendix for the complete list Removing coding regions Continuing with the process developed by Vavouri et al, only those conserved portions for each species pair were retained that did not overlap any known coding regions. Identification of the coding regions was a multi-step process that included checking genome annotations, looking for transfer-rna (trna) coding regions, and matching against known expressed sequence tags (ESTs) for that species. Additionally, low-complexity repeats in the genome, and known elegans repeats were also marked and removed with the help of the RepeatMasker software package (Smit et al., ). Filtering megablast results against genome annotations The most important step in deciding if a conserved segment overlapped a coding region was to check it against the genome annotation. Annotations in the General Feature Format (GFF) were downloaded from WormBase (Bieri et al., 2007) for C. elegans, C. briggsae, and C. remanei. These three annotation GFFs were sufficient for checking 9 of the 10 pairwise comparisons, but no GFF was available for B. malayi so the B. malayi T. spiralis pair was processed differently as described later in this section. The GFF file fragment in Figure 2.4 lists several fields, but the important ones for this project were the first five: seqname, source, feature, start, and end. Each line represents a feature at a particular location on the genome. seqname was used to identify the chromosome or contig for which the annotation was provided, and the next two fields were used to determine if that annotation was for a coding region or not. The start and end fields mark the coordinates of that feature on that chromosome or contig. The Appendix lists the source and feature combinations found in the three GFF files for C. elegans, C. briggsae, and C. remanei. This list was examined manually and features 14

23 ÓÙÖ ØÙÖ Ó Ò Ä Ì Ö Ø ÒÙÐ ÓØ Ñ Ø Ä Ì Ð Ò Ø ÒÙÐ ÓØ Ñ Ø ÙÖ Ø Ë ÙÖ Ø Ó Ò ÜÓÒ ÙÖ Ø ÜÓÒ ÙÖ Ø ÒØÖÓÒ Û ØÖÓÒ ÒÙÐ ÓØ Ñ Ø Figure 2.5: Fragment of coding region file with asterisks used to tag GFF source and feature combinations that specified a coding region. that referred to coding regions were tagged with an asterisk (Blaxter, 2007, Vavouri et al., 2007), and stored in a tab-separated file (Fig. 2.5). This tab-separated file for identifying exons was then used in program clean-sort-gff.pl to pull out all the lines in the GFF file that referred to coding regions. clean-sort-gff.pl combined the coordinates of the coding regions (if they overlapped) and only wrote out the start and end coordinates of the combined region to a file that was created for each chromosome or contig referred to in the GFF file. Preprocessing the GFF file in this way into a sorted, non-overlapping set of start and end coordinate pairs for each known coding region was a major optimization. The simplified coding region coordinate file became sufficiently short that it could be loaded into memory, and could be binary searched to see if a megablast result overlapped a coding region. Whereas naive code for checking each megablast result against the entire GFF annotation file took almost 15 hours on a high-end workstation, this optimization sped up the process by a factor of almost 20,000. (e.g., 36,000 megablast results for the C. elegans C. briggsae pairwise comparison could be checked against the C. elegans GFF with 15 million records in less than 3 seconds). The B. malayi T. spiralis pair of species was tackled differently as no GFF annotation was publicly available for the B. malayi genome. The list of conserved elements after the megablast step was converted to a FASTA file (using mbl2fasta.pl, described in more detail in the next section) and this FASTA file was blasted (i.e., program blastn Altschul et al., 1997 was used to find matches between the two sets of sequences) against a database of known B. malayi coding sequences, also in FASTA format. The conserved regions in the B. malayi T. spiralis pair that matched the coding sequences for B. malayi with an e-value less than were removed, leaving a set of putative CNEs for this pair. Converting filtered megablast results to FASTA In the previous step, megablast results for each pair of species were checked against GFF annotations and those that overlapped coding regions were removed. The puta- 15

24 tive CNEs were still in megablast output format (specified as a chromosome or contig location, along with the start and stop coordinates). The next set of steps required the putative CNEs to be in FASTA format so that the actual nucleotides could be checked to remove additional coding regions that were missed by the GFF checking step. Program mbl2fasta.pl converted each putative CNE in megablast output format to a FASTA sequence by looking up the appropriate genome sequence file, finding the right chromosome or contig, and pulling out the nucleotides from the start to the stop coordinate. CNEs for the B. malayi T. spiralis pair had already been converted into FASTA format in the previous step, so mbl2fasta.pl was not run for that pair. Using RepeatMasker to scan for simple or known elegans repeats RepeatMasker (Smit et al., ) is a program that finds interspersed repeats and low-complexity DNA sequences (such as ATATAT... ) and masks these repeats by replacing the repeated nucleotides with a series of Ns. Simple repeats like these are very frequent in the genomes of all species and can lead to uninformative alignments or matches when looking for conserved sequences. RepeatMasker was run on putative CNEs from the previous step at the slowest, most sensitive setting, using cross_match (Ewing and Green, 1998) as the comparison engine, and the RepBase library (Jurka et al., 2005) of known repeats for C. elegans. The program remove-rm.pl was then used to remove all CNEs that contained more than 80% repeats. RepeatMasker could have been run first for each species before megablast was used to find conserved regions, but it takes a long time to mask repeats in large genomes, so it was more optimal to first run megablast (a very fast algorithm), and then run RepeatMasker only on the conserved regions found. Using trnascan to scan for trna coding regions Once simple repeats had been removed from the putative CNEs, trnascan (Lowe and Eddy, 1997) was run to find regions which matched known trna coding genes. Some of these regions were removed at the GFF checking stage because some of the better annotations included information on trna coding regions. Similar to the previous step, regions identified by trnascan as trna coding genes were removed using the program remove-trna.pl. Removing known Rfam and mirna regions Continuing the process of filtering through putative CNEs to remove all known coding regions, the next step was to check the Rfam and micro-rna (mirna) databases. 16

25 The Rfam database (Griffiths-Jones et al., 2005) has information about families of noncoding RNA and other structural RNA elements, and the mirna database (Griffiths- Jones, 2004) contains predicted hairpin portions of mirna transcripts. Putative CNEs were blasted against both these databases (using blastn with -e set to ) and any CNEs that showed hits in these two databases were removed using remove-blastcoding.pl. Removing ESTs for poorly annotated species The final step in Vavouri et al s method for discovering CNEs was to look for matches between the putative CNEs for a pair of species and the Expressed Sequence Tag (EST) databases for those species. An EST is a low-cost, low-quality sequence of nucelotides obtained by sequencing cloned mrnas. Because they are obtained from mrnas, a match to an EST is a positive indicator of a coding region, even though it may not be a protein coding gene. EST sequences were downloaded from EBI (Harte et al., 2004) for all the poorly annotated species in this study (i.e. all except C. elegans). As in the previous step, putative CNEs were blasted against the EST sequences for these species, and all sequences that had a match were removed using remove-blastcoding.pl. Finding and removing matches against other databases and genomes The steps described so far for removing coding regions from a set of conserved elements between two species were proposed by Vavouri et al. (2007). Additionally, Blaxter (2007) suggested blasting (i.e., using programs from NCBI s blast suite of programs) the remaining CNEs against three other sequence sources to be sure that the CNEs obtained were non-coding: blastx against NemPep3: NemPep3 (Wasmuth and Blaxter, 2006) is an exhaustive database of protein sequences from the phylum Nematoda. blastx was used to compare nucelotide sequences against the NemPep3 protein sequence database. blastx against UniRef90: UniRef90 (Harte et al., 2004) is a non-redundant reference database of all proteins in the UniProt database. Protein sequences with 90% identity are clustered together in UniRef90. tblastx against C. elegans genome (for pairs without C. elegans): tblastx compares a nucleotide query sequence against a nucleotide target database, after translating all six reading frames of both sets of sequences. Because C. elegans had the 17

26 best annotated genome, CNEs from pairs that were not checked against the C. elegans GFF were checked against the C. elegans genome to see if any of the CNEs matched known elegans coding regions. These three searches addressed the same issue: looking for and removing known protein coding regions (especially in nematodes) from the set of putative CNEs. In all three cases, the CNEs with blast hits satisfying an e-value less than were removed using remove-blast-coding.pl. 2.3 Determining CNE similarities The megablast program used for finding conserved regions returns two measures of similarity for each sequence alignment that it finds: Percentage identity is the most basic measure of similarity between two sequences and is defined as the percentage of total bases in the alignment that are identical between the two sequences. Bit-score is based on the raw score obtained by summing the positive scores for each nucleotide match and the negative scores for each nucleotide mismatch or gap. The raw score is normalized to give the bit-score. Percentage identity and bit-scores are impossible to derive from the megablast results alone when more than two megablast results are combined (overlapping hits were combined to give a consolidated region according to the procedure by Vavouri et al., 2007). Therefore, based on Dubchak et al. (2000), the lowest percentage identity or lowest bit-score of a set of overlapping hits was taken to represent the overall percentage identity or bit-score of the combined region. These two measures were used (along with the length of the CNE) to indicate the level of conservation of each pairwise CNE. However, the goal of this thesis was to see if the level of conservation of a CNE changed for different pairs of species. Therefore, a way was needed to establish if two CNEs from different pairwise comparisons represented the same canonical CNE. Blastclust (NCBI, 2007) was used to cluster the CNEs on the basis of their sequence similarities. CNEs belonging to the same cluster had very similar sequences, and, if they came from different pairs of species, then the levels of identity of each CNE could be compared based on the time since divergence of the species in that pair. 18

27 For performing all the statistical analyses and creating plots of the levels of identity, Mathematica 6.0 (Wolfram Research Inc, 2007) was used. Mathematica s list and set processing capabilities were especially useful in selecting clusters that contained CNEs from the desired pairs of species. The results of finding the CNEs, and comparing their similarities across different species pairs based on how long ago the pair diverged, are reported in the next chapter. 19

28 Chapter 3 Results This chapter presents the results of using the pipeline described in Chapter 2 and systematically varying the parameters for finding CNEs. The first part presents the counts of CNEs that were discovered for each pair of species at different levels of sensitivity, and the second part presents CNEs that were shared across multiple pairwise comparisons. One of the main goals of this study was to identify the shared CNEs and see how their level of identity changes depending on how long ago the species they came from had diverged. Although a few CNEs were found for pairs of species that diverged furthest in the past, none of those CNEs are seen in other pairwise comparisons, and so it is impossible to study how those CNEs have changed in identity over time. The third part of the results concentrates on the properties of CNEs for the three species for which the most CNEs were found: C. elegans, C. briggsae, and C. remanei. Unfortunately, these three species diverged from each other relatively recently (100 MYA compared to 500 MYA for B. malayi and 700 MYA for T. spiralis) so they cannot be used to trace a long evolutionary history of CNEs. The fourth and final part describes some aggregate properties of the CNEs found in each pairwise comparison. The results are clear that very few or no CNEs were found for pairs of species that diverged earliest, and that none of the thousands of CNEs found for the three species that diverged most recently were shared beyond that group. This is the main finding of this study and is presented in the Discussion (Chapter 4) with more context, along with the implications of such a finding. 20

29 Table 3.1: CNEs found for all ten pairs (in alphabetical order) of five nematode species using the method and parameters from Vavouri et al. (2007) for finding CNEs (with megablast parameters -W 30, -e 0.001) Pair hits Combine hits Filter against GFF of better annotated species Remove megablast megablast repeatmasked Remove trna Remove Rfam, mirna, and EST matches B. malayi T. spiralis * C. briggsae B. malayi C. briggsae C. remanei C. briggsae T. spiralis C. elegans B. malayi C. elegans C. briggsae C. elegans C. remanei C. elegans T. spiralis C. remanei B. malayi C. remanei T. spiralis All pairs *For B. malayi, no annotation GFF was available, so the megablast hits were blasted against a database of known B. malayi coding sequences, and matches were removed. 3.1 CNE counts CNEs found using methodology and parameters from Vavouri et al. Table 3.1 summarizes the numbers of CNEs found at each stage of the pipeline. The last column lists the number of CNEs found after all the steps proposed by Vavouri et al. (2007). The first step for finding CNEs as described in Chapter 2 is to use megablast to find the conserved regions and this table displays the CNEs found with word seed size 30 (-W 30) and e-value threshold (-e 0.001) as described in Vavouri et al. The sixth row in Table 3.1 (C. elegans C. briggsae) acts as a confirmation that the programs written for this project did what they were supposed to do. The number of CNEs found for the pairwise comparison between C. elegans and C. briggsae corresponds well with Vavouri et al s findings of 3061 putative CNEs using the same method between the same pair of species. To further verify that the pipeline worked as intended, the set of elegans and briggsae CNEs found was blasted against the set of CNEs published as supplemental online material by Vavouri et al. (i.e., the program blastn was run with stringent settings and all the CNEs matched consistently). The number of putative CNEs for that pair reported here is lower than their finding because of more annotation information available for C. elegans since their study, more ESTs for C. briggsae in Harte et al. (2004) that were checked against, and updated genome sequences for C. elegans and C. briggsae. 21

30 It is immediately apparent from the last column in Table 3.1 that no CNEs were found at this sensitivity level (-W 30, -e 0.001) for two pairs (C. elegans B. malayi, and C. elegans T. spiralis), practically none were found for three other pairs (B. malayi T. spiralis, C. remanei B. malayi, and C. remanei T. spiralis), and very few were found for two other pairs (C. briggsae B. malayi, and C. briggsae T. spiralis). The only pairs for which thousands of CNEs were found were the pairwise comparisons between the three Caenorhabditis species - elegans, briggsae, and remanei. These three comparisons resulted in thousands of CNEs found, more in keeping with the numbers expected on the basis of past studies. The pairs for which no CNEs were found at all using Vavouri et al s methodology and their levels of sensitivity for detecting conserved regions, were the ones between the best annotated species (C. elegans) and the species that diverged earliest from the others (B. malayi and T. spiralis). This finding may indicate that the few CNEs that were found in other pairwise comparisons with B. malayi or T. spiralis are the result of poorly annotated genomes where some coding regions have not yet been removed as thoroughly as they were for C. elegans. The big-bang hypothesis that this study set out to verify ( CNEs emerged once, several hundred million years ago ) has thus been dealt a severe blow, because the very first look at pairwise CNEs seems to indicate that there are no or very few conserved noncoding elements found when comparing nematode species that diverged more than 100 MYA. This is a strong claim and is examined in more detail in the Discussion chapter. Just to be sure that the megablast program heuristic was not accidentally leaving some conserved regions out, all the CNEs found in pairwise comparisons between the three Caenorhabditis species were blasted against the B. malayi and T. spiralis genomes (using program blastn with e-value threshold and word seed size 30) and no hits were returned. The next two parts of this section report the CNEs found at less stringent levels of sensitivity, with some additional steps to ensure that all coding regions were being identified as best as possible CNEs found after additional steps to remove coding regions Three additional steps were performed to identify and remove coding regions in the conserved portions of pairwise comparisons. For details, see Section on Removing Coding Regions in the chapter on Methods and Materials. Performing the additional steps (removing putative CNEs which had hits at e-value threshold in blastx searches against NemPep3 and UniRef90, and a tblastx search against the coding regions of the well annotated C. elegans genome) gives us Table 3.2. The CNE 22

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot