Processed Small RNAs in Archaea and BHB Elements

Size: px

Start display at page:

Download "Processed Small RNAs in Archaea and BHB Elements"

Jonas Farmer
5 years ago
Views:

1 University of Leipzig Bioinformatics Master s Program in Bioinformatics Master s Thesis Processed Small RNAs in Archaea and BHB Elements submitted by Sarah Berkemer March 16, 2015 Supervisor Prof. Peter F. Stadler Advisor Dr. Christian Höner zu Siederdissen Reviewers Prof. Peter F. Stadler Dr. Christian Höner zu Siederdissen

3 Acknowledgement I would like to thank Peter F. Stadler for supervising me with this work but also for the support, the motivating goals and for enhancing possibilities of distinct workplaces in the world. I also want to thank Christian Höner zu Siederdissen who advised, helped and supported me a lot, especially for reviewing in really short time and his (almost) 24/7 answering skype account open for many questions. Thanks to Petra Pregel and Jens Steuck for bureaucratic and technical support. A lot of the work was done at the TBI in Vienna, where I could spend two months. Thanks a lot to Ivo L. Hofacker who invited me to Vienna and the TBI table soccer team for good discussions and a great time. I want to thank Fabian Amman and Sebastian Will for support and help in topics I didn t know a lot about before and for patiently answering all my questions. I would like to thank my collegues and friends who supported, motivated and also distracted me when I needed a break. I also want to thank a lot Sven and my family being always available when needed, giving me mental and material support.

5 Abstract Bulge-helix-bulge (BHB) elements guide the enzymatic splicing machinery that in Archaea excises introns from trnas, rrnas from their primary precursor, and accounts for the assembly of piece-wise encoded trnas. This processing pathway renders the intronic sequences as circularized RNA species. Although archaeal transcriptomes harbor a large number of circular small RNAs, it remains unknown whether most or all of them are produced through BHB-dependent splicing. In this thesis the approach of two genome-wide surveys of BHB elements of a phylogenetically diverse set of archaeal species is shown and further complemented by searching for BHB-like structures in the vicinity of circularized transcripts. A result of this work shows that besides trna introns, the majority of box C/D snornas is associated with BHB elements. Not all circularized srnas, however, can be explained by BHB elements, suggesting that there is at least one other mechanism of RNA circularization at work in Archaea. Pattern search methods were unable, however, to identify common sequence and/or secondary structure features that could be characteristic for such a mechanism. The thesis is based on the work by Höner zu Siederdissen et al. [1] and its extension by Berkemer at al. [2]. The first study gives an overview of the topic, especially BHB elements and the application of programs like Infernal [3] and segemehl [4, 5] to find such secondary structure motifs. The second study refines the models such that the search can be done more specifically. Additionally, more data sets were included in the workflow to search and evaluate further results, in particular the inclusion of both, the previously considered intron-splicing BHB elements, and an additional class of box C/D snorna-splicing elements.

7 Contents 1. Introduction 1 2. Biological Background Archaea RNA Processing Splicing Mechanism RNAs in Archaea Splicing in Archaea Circularization Circularized introns Examples of circular RNA in Archaea Bulge-Helix-Bulge (BHB) elements Structure of BHB elements Location of BHB elements Computational Methods Mapping Covariance Models Pattern Search Workflow Known and Putative BHB Elements Circular transcripts Evolutionary Conservation Element-based search Pattern Discovery Results and Discussion Results Known box C/D snornas with BHB elements Circularized RNAs with BHB elements Circular RNAs without BHB elements Discussion Conclusion and Outlook 43 A. Appendix 45 I

8 Contents List of Figures 62 List of Tables 67 II

9 1. Introduction Since the discovery of Archaea in the 1970s and their definition by Woese et al. in 1990 [6], much has been done concerning sequencing of archaeal genomes and their analysis. Before, Archaea were classified into the kingdom of Prokaryotes together with Bacteria. However, the archaeal species share traits with Eukaryotes as well as with Bacteria. In this way, Archaea are now understood as one of the three domains of life, clearly distinguished from Eukarya and Bacteria. Even though they seem to be in between Eukarya and Bacteria when looking at their genomics and molecular traits, there exist theories which cluster Archaea together with Eukarya into one domain, supporting the theory that Eukarya seem to have evolved from archaeal ancestors [7]. The small non-coding RNAs (srnas) of Archaea are much less understood than their eubacterial counterparts. At least in part this is a consequence of the comparably much smaller number of fully sequenced genomes since the large evolutionary distances between them hamper homology-based annotation. Archaea share ribosomal RNAs (rrnas), transfer RNAs (trnas) and the RNA components of RNase P and the signal recognition particle (7S RNA) with the other two domains of life. Although a large number of other srnas has been described for individual species, only three RNA classes are unambiguously recognizable within this diversity: Archaea and Eubacteria share CRISPR-Cas adaptive immune systems [8], and two classes of guide RNAs direct chemical modifications of rrnas and other non-coding RNAs in both Eukarya and Archaea [9]. Both the archaeal box C/D and the box H/ACA RNPs contain core proteins clearly homologous to those of the eukaryotic snornps, reviewed in [10, 11]. Hence they are often referred to as snornas even though Archaea do not have a nucleolus. Thus, in this work the nomenclature is maintained. Box C/D srnps have a dimeric architecture [12] throughout the Archaea. Like their eukaryotic counterparts they catalyze 2 -O-ribose methylation at target sites determined by the box C/D snorna [9]. These RNAs have a length of about nucleotides and feature two functionally essential kink-turns associated with the C/D and C /D sequence motifs that are characteristic for the RNA class [13]. The box H/ACA snornas guide the formation of pseudouridine [9, 14]. Their canonical secondary structure consists of a single stem-loop structure. In contrast, eukaryotic snornas usually consist of two such stem-loops, each of which addresses an individual target. Beyond the canonical forms, recently several pseudouridylation guide RNAs with divergent secondary structures have been reported [15]. Circularized forms of small RNAs are abundant in some Archaea [16]. Some box C/D snornas seem to be present predominantly or possibly exclusively as circular RNAs [17, 16, 18]. This is also true for assorted other small RNA species, among them the 5S rrna [16]. The biosynthesis 1

10 1. Introduction of most of these circular RNAs remains unknown. understood as a consequence of enzymatic splicing. Only in a few select cases it is Archaea completely lack a spliceosomal splicing machinery. Enzymatic splicing, however, is a rather common feature in archaeal RNA processing. Mechanistically, it is closely related to trna splicing in Eukarya. A specific endonuclease recognizes and cleaves the so-called bulge-helix-bulge (BHB) structure, a specific ligase then joins the exons and circularizes the intron [19, 20, 21, 22]. The prime example of BHB splicing is the removal of trna introns. In contrast to Eukarya, archaeal trnas may have multiple introns [23, 24] and the same mechanism implements a form of transsplicing that composes trnas from two or three independently encoded fragments [25, 26, 27, 28, 29, 30]. The maturation of the ribosomal RNA operon proceeds with the help of two BHB elements that form over long distances and guide formation of circularized precursors of the 16S and 23S rrna, respectively [31, 32]. At least one pre-mrna (CBF5, the archaeal homolog of dyskerin) contains an intron with a BHB element in many crenarchaeal species [33]. Although many of the box C/D snor- NAs in several archaeal species appear as circularized RNA molecules in the cell [34], this is known to be the consequence of BHB-guided splicing in only a single case, namely the box C/D snorna processed from a long intron of the trna-trp precursor in Pyrococcus species [35]. The BHB elements of trna introns and rrnas are well-conserved across the Crenarchaeota phylum [36] and have been studied in detail [36, 37, 38, 39, 40, 41, 16, 34]. They fall into three classes originally defined in [36]. This thesis is based on two studies by Höner zu Siederdissen et al. [1] and Berkemer et al. [2] which explore mainly two interrelated questions: (1) Can BHB-elementdependent splicing explain all or at least most of the observed circularized or permuted small RNAs, and (2) to what extent can BHB elements be used in their own right as means of detecting novel small RNAs and/or likely sites of RNA processing. For the first question, the focus is based in particular on box C/D snornas because the members of this abundant RNA class are often circularized. The second question is of practical relevance since in many cases RNA-seq data do not provide decisive information. (a) Circularized RNAs are depleted in most RNA-seq protocols unless specifically enriched e.g. by RNase R treatment [16]. (b) Spliced trnas cannot be detected in cases where an unspliced paralog is present in the same genome [18]. (3) Circularized introns e.g. of trnas are often too short to be detectable by sequencing in their own right. For instance 5 out of 8 introns listed in [34] have a length of 26 nt or less. This work consists of 6 chapters separated in Introduction 1, Biological Background 2, Computational Methods 3, Workflow 4, Results 5 and Conclusion 6. The following chapter, Chap. 2, describes the biological background focused on kinds of RNA and splicing mechanism in Archaea compared to the other two domains, namely Eukarya and Bacteria. Furthermore, the structure and location of BHB elements serving as a recognition motif for enzymatic splicing in Archaea is explained. Chapter 3 then presents the computational methods used to analyze the given data. 2

11 Thus, the programs are explained and reasons are given for having chosen them. In the next chapter, Chap. 4, the steps of the analysis are enumerated and sketched, including analyzing the data, building the models and evaluating the results. This workflow includes the data which was used, the way the data was analyzed and a description of how the programs were applied to the data. Furthermore, the chapter describes the search for homologous sequences and further sequence or secondary structure patterns. In Chapter 5, results of the analysis are shown and explained. The results are separated into two parts, namely a subset of sequences showing evidence for a flanking BHB element, and sequences were no BHB element was found. The latter sequences were analyzed further and results are shown. Additionally, a short discussion about the results is included in Chap. 5. The last chapter, Chap. 6 gives a conclusion about the work and an outlook about future work. The material and results of this thesis are summarized in two publications. The first paper titled Comparative detection of processed small RNAs in Archaea. was written in 2014 Christian Höner zu Siederdissen, Sarah Berkemer, Fabian Amman, Axel Wintsche, Sebastian Will, Sonja J. Prohaska and Peter F. Stadler and is published in the proceedings of the IWBBio 2014 (International Work-Conference on Bioinformatics and Biomedical Engineering). The second publication titled Processed Small RNAs in Archaea and BHB Elements. was written in 2015 by Sarah J. Berkemer, Christian Höner zu Siederdissen, Fabian Amman, Axel Wintsche, Sebastian Will, Ivo L. Hofacker, Sonja J. Prohaska and Peter F. Stadler and is submitted to Genomics and Computational Biology. 3

13 2. Biological Background This chapter gives an introduction to the domain Archaea in general, unique traits found in archaeal species but also similarities in comparison with the other two domains, Bacteria and Eukarya [6]. Furthermore, RNA processing, especially splicing mechanisms, are explained. By now, the archaeal splicing mechanism is thought to always produce a circularized intron. There are different types of RNA molecules which undergo the splicing procedure. Even though circularized RNA molecules are found rarely in nature, they seem to be abundant in Archaea [42]. The second part of this chapter focuses on BHB elements, also called bulge-helix-bulge elements, which are a secondary structure element closely related to the archaeal splicing mechanism. Thus, pre-rna molecules form this bulge-helix-bulge motif which acts as a recognition element for the splicing enzymes [36]. By now, no typical sequence motifs related to BHB elements are found and it is known, that BHB elements are quite flexible in structure and length which makes it more complicated to search and locate BHB elements Archaea Since the discovery of the archaeal species in the 1970s [6], numerous studies contributed to gain a more detailed and deeper understanding of the archaeal domain and their role during the evolution of species. Originally, Archaea were thought to belong to the kingdom of Prokaryotes. Based on protein sequences and structural characteristics, Woese et al. proposed the three domains of life [6]. Additionally, first archaeal species were found and classified into the evolutionary tree of life as it can be seen in Figure 2.1. Archaea share characteristics with Bacteria as well as with Eukarya but one can find several traits which are unique for the archaeal species as well. Table 2.1 gives short definitions for the three domains of life as defined by Woese et al. [6]. However, their classification might change again based on current insights gained through sequencing technologies and computational methods which show evidence for a classification of Archaea and Eukaryotes into one domain [7]. Common Traits of Archaea and Bacteria Single cellular organisms without a nucleus are called prokaryotic. Before distinguishing Bacteria and Archaea based on their genetics, they were classified together into the kingdom of Prokaryotes regarding their phenotypes [6]. Another common trait of Archaea and Bacteria is the shape of their genome. Archaeal and bacterial genomes, apart from a few exceptions, consist of one chromosome which is circularized inside the 5

14 2. Biological Background Figure 2.1.: Tree of life (a) as developed by Woese et al. [6] when defining the three domains of life, Bacteria, Archaea and Eukarya and evolutionary relations between them. Panel b shows how organisms evolved. Arrows indicate the direction of evolution whereas for Archaea and Gram-positive Bacteria a coevolution is assumed, indicated by the double-headed arrow. The dashed lines show putative symbiotic events of archaeal and bacterial cells which led to eukaryotic cells. Monoderms and Diderms indicate the number of membranes, thus one or two which bound the prokaryotic cells. Here, CM stands for cell membrane, CW for cell wall, OM for outer membrane and PE for periplasm. The figure is taken from Gupta [43]. cell soma [6, 44]. The genomes lengths and the number of protein-coding genes vary between the species but Bacteria and Archaea both seem to have their genes arranged into operons [44], regulating gene transcription and expression. As well as Bacteria, Archaea express so called CRISPR RNAs which act as an immune system in prokaryotic organisms being a protection against foreign plasmids or phages [45]. CRISPR RNAs are composed of repetitive sequences with short spacer sequences in between. They are often associated to cas proteins which are related to CRISPR RNAs, together called the CRISPR/Cas system. Common Traits of Archaea and Eukarya Even though single cellular organisms can also be found within the domain of Eukarya, a nucleus can be found in nearly all the eukaryotic species whereas archaeal species lack a nucleus. Still, the translation and transcription processes in Eukarya and Archaea are similar. An exception are archaeal RNA-polymerase and ribosomes which differ from the eukaryotic ones but still resemble more their eukaryotic relatives than the ones in Bacteria [46]. Beside several genetic traits which are common or very similar between Archaea and Eukarya, it has been discovered that both, archaeal and 6

15 2.1. Archaea Domain Eucarya [Greek adjective ɛˆυ for good or true and Greek noun καρυoν for nut or kernel, referring to the nucleus]: cells eukaryotic; cell membrane lipids predominantly glycerol fatty acyl diesters; ribosomes containing an eukaryotic type of rrna. Domain Bacteria [Greek noun βακτηριoν for small rod or staff ]: cells prokaryotic; membrane lipids predominantly diacyl glycerol diesters; ribosomes containing a (eu)bacterial type of rrna. Domain Archaea [Greek adjective ˆαρχαιoς for ancient, primitive]: cells prokaryotic; membrane lipids predominantly isoprenoid glycerol diethers or diglycerol tetraethers; ribosomes containing an archaeal type of rrna. Kingdom Euryarchaeota (Archaea) [Greek adjective ɛˆυρυς for broad, wide, spacious; Greek adjective ˆαρχαιoς for ancient, primitive]: relatively broad spectrum of niches occupied by these organisms and their varied patterns of metabolism; ribosomes containing an euryarchaeal type of rrna. Kingdom Crenarchaeota (Archaea) [Greek noun κρηνη for spring, fount; ˆαρχαιoς for ancient, primitive]: ostensible resemblance of this phenotype to the ancestor (source) of the domain Archaea; ribosomes containing a crenarchaeal type of rrna. Table 2.1.: Definitions for the three domains of life and two archaeal kingdoms by Woese et al. in 1990 [6]. eukaryotic DNA is associated with histones. Archaeal species just use the core part of the histone complex whereas Eukaryotes have two additional subunits bound to the outer part of the DNA-histone complex [44]. Another important common trait among Eukarya and Archaea is the existence of snornas (small nucleolar RNAs) [48], a group of ncrnas which are short in sequence length and guide RNA modifications. Two families of snornas, as they are called in Eukaryotes, are box C/D and box H/ACA snornas. They guide RNA methylations and other RNA modifications and are characterized by their sequence motifs. Box C/D snornas show conserved motifs named boxes C (5 PuUGAUGA3 ) and D (5 CUGA3 ) which fold into two kink-turn structural motifs. They also contain a less conserved copy of the motifs which are called boxes C and D. Boxes C and D can be found near the start and end of the sequences whereas boxes C and D form the central part of the sequence. Box H/ACA snornas consist of conserved motifs called box H (ANANNA, where N stands for any nucleotide) and box ACA (ACA, always found 3 nucleotides away from the 3 end). Box C/D and H/ACA snornas also contain antisense elements or guide regions which serve as the binding region to guide RNA modifications, as depicted in Figure 2.2. Target molecules can be 16s and 23s rrna but also other types of RNAs. Especially rrnas are in need for RNA modifications guided by small RNAs to ensure their stability [49]. 7

16 2. Biological Background Figure 2.2.: Schematic secondary structure (a) of a box C/D snorna showing the locations of box motifs whereas box C (UGAUGA) and box D (CNGA) are highly conserved and box C and box D are less conserved. Between the box motifs guide regions are located used to bind rrna molecules for modification. A box H/ACA snorna (b) has conserved box H (ANANNA) and box ACA (ACA) motifs and guide regions to bind rrna molecules (taken from [47]). Unique traits of Archaea (and close relatives) Regarding the sequences of the 16s rrnas of diverse prokaryotic species, Woese et al. discovered that the 16s rrna found in a subgroup of the species differed significantly from the remaining data set [50]. Based on this discovery, Woese could define the three domains of life which consist of Archaea, Bacteria and Eukarya [6] and replaced the theory of the two kingdoms, the Prokaryotes and Eukaryotes. The 16s rrna is a ribosomal RNA used for protein production. It occurs in prokaryotic organisms and supports the production of several vital proteins. Therefore, a mutation in the 16s rrna is mostly lethal. Still, Woese et al. found different kinds of 16s rrna inside the prokaryotic kingdom which led to the formation of two different domains of life. Another trait distinguishing Archaea from Bacteria and Eukaryotes is the structure of their membranes. Since many archaeal species can be found in extreme environments, the molecular structure of archaeal cell membranes differs from other species due to stability reasons. As depicted in Figure 2.3, archaeal membranes include Ether bonds linking the head of the lipids to the tail. For Bacteria and Eukarya, Ester bonds fulfill this function [51]. Additionally, in some archaeal as well as bacterial species the membranes consist of just a single lipid layer instead of a double lipid layer. In lipid bilayers, the membrane 8

17 2.1. Archaea Figure 2.3.: Membrane lipids of Archaea and Bacteria [51]. The main difference between the lipids is the kind of bonds linking the backbone with the fatty acid to form a membrane lipid. In Bacteria as well as Eukarya, this is done by Ester bonds whereas in Archaea, Ether bonds are used due to stability reasons since many archaeal species can be found in extreme environments. consists of two phospholipid molecules which are mostly subdivided into an hydrophilic head and a hydrophobic tail. The membrane is formed in a way such that the tails of the two molecules point into the inside of the membrane, the heads form the outer edges. In this way the inner part is protected from water whereas the heads are formed to interact with water molecules. In many archaeal and several bacterial species such as Gram-positive Bacteria, the membrane consists just of one lipid layer. The phospholipids consist of one inner fatty acid chain which is hydrophobic. The inner part of the molecule is protected by two hydrophilic heads at each side of the fatty acid chain. In this way, the membrane still has the property of being able to interact with water molecules at the outer edge but is more stable in case of extreme conditions [51]. This common trait between several archaeal species and Gram-positive Bacteria led to the assumption of a close evolutionary relationship among them. There are a number of competing theories since most archaeal species are immune against antibiotics produced mainly by Gram-positive Bacteria [43]. This would support the archaeal ancestors evolving from gram-positive bacterial species due to selective pressure and adaption to extreme environments, a theory which is supported by the fact that many archaeal species can be found in environments being very hot or cold, sulfurous, highly saline, acidic, or alkaline water. Extremeophile Archaea belong to one of four main physiological groups: halophiles, thermophiles, alkaliphiles, and acidophiles [52]. Still, there exist also some archaeal species living in mild or 9

18 2. Biological Background moderate conditions [53]. Being able to adapt to extreme or unusual living conditions - such as hot or acidic environments but also adapting to neighbours with an affection to antibiotics - can also be seen when looking at their metabolism. Some archaeal species are able to perform methanogenesis, thus being able to produce methane [44]. Figure 2.4.: Different hypotheses for the tree of life [7]. The three-domains theory (a) was first suggested by Woese et al. [6]. In the Eocyte hypothesis (b), Eukarya are classified within the Archaea which leads again to just 2 domains of life [7] however, distinct from the Prokaryotes-Eukaryotes classification. Additionally, the species of the TACK superphylum classified as closely related species are depicted here. The TACK species play an important role regarding distinct evolutionary hypothesis. Depending on the hypothesis, TACK species are close relatives to Eukaryotes or Euryarchaeota. Evolutionary Relationships Since the discovery of Archaea by Woese et al. [6], Archaea are assumed to play an important role in the evolutionary development of species. With the detection of more and more archaeal species and their genetic traits, their evolutionary relations to Bacteria and Eukarya get more and more complex. In recent studies, a new superphylum called TACK was proposed by Guy et al. [54]. This superphylum includes 4 archaeal species, namely Thaumarchaeota, Aigarchaeota, Crenarchaeota and Korarchaeota (for definitions, see Table 2.2), which are closely related to each other but on a different branch in the phylogenetic tree than the Euryarchaeota. In fact, the Euryarchaeota are the focus of another discussion. With the definitions of new phyla among the archaeal domain and new sequencing technologies, new archaeal species can be found and classified into the phylogenetic tree [55] as depicted in Fig In this work, small RNAs in five different archaeal species will be analyzed, namely Methanopyrus kandleri, Sulfolobus acidocaldarius, Sulfolobus solfataricus, Nanoarchaeum equitans and Ignicoccus hospitalis. As it can be seen in the phylogenetic tree, Fig. 2.5, M. kandleri and N. equitans are classified to the Euryarchaeota, whereas M. kandleri 10

19 2.1. Archaea Thaumarchaeota: Reside in both aquatic and terrestrial environments; recently proposed phylum of abundant chemolithoautotrophic ammoniaoxidizers that play an important role in biogeochemical cycles such as the nitrogen cycle and the carbon cycle. Aigarchaeota: species of the Hot Water Crenarchaeotic Group I (HWCGI); one single species identified by now (Ca. Caldiarchaeum subterraneum), which was identified in a microbial mat at a geothermal stream of a subsurface goldmine; its genome contains some intriguing eukaryotic features Crenarchaeota: well-characterized archaeal phylum first described by Woese [6]; comprising mostly (acido)thermophilic anaerobes, although some aerobic and micro-aerophilic species exist; Crenarchaeal cells come in various shapes, ranging from coccoid to rod-shaped/filamentous, and they mostly rely on respiration for generation energy Korarchaeota: a group of deepbranching Archaea with ultra-thin, needleshaped small cells; geographically restricted to terrestrial and marine thermal environments Table 2.2.: Summary of the definitions of the TACK superphylum by Guy et al. [54] belongs to the group of Methanopyrales and N. equitans to the Nanoarchaeota phylum. In vivo, N. equitans lives in a symbiotic manner with I. hospitalis such that collecting genomic data of one always includes also the second species. I. hospitalis as well as S. solfataricus and S. acidocaldarius belong to the Crenarchaeota, one of the phyla inside the TACK superphylum. Thus now, there are also new possibilities for redefining the relations between the three domains of life. One recently published theory is the theory of the 2 domains of life [7, 56], since here, Eukaryotes are classified within the domain of Archaea, such that Eukaryotes are closely related to the species of the TACK superphylum. The Euryarchaeota are then the next relative to the group of Eukaryotes and TACK species. This can be seen in Figure 2.4. This theory is based on the assumptions that current phylogenetic methods are more accurate than the ones Woese et al. used in 1990 and before [6]. At the same time reasonable evolutionary causes set the focus on the 2 domains theory [7, 56] even though none of the theories was shown to be significantly more reasonable. 11

20 2. Biological Background Figure 2.5.: Phylogenetic tree of life for archaeal species in 2011 (taken from [55]) 12

21 2.2. RNA Processing 2.2. RNA Processing RNA processing seems to be an essential task in all known organisms. Even Bacteria process their RNAs, but they seem to have less RNA processing in total in comparison to Archaea and Eukarya [48]. A lot of different pathways and targets for RNA processing have been identified. RNA processing includes synthesizing and digesting of RNA molecules but also their modification. Such modifications are done to stabilize the RNA molecule and protect it from digestion. This can be done by adding short sequences at its ends as done with mrna molecules. But also circularizing the molecule will lead to a more stable structure. Another important task in RNA modification is the splicing procedure. Here, one or more parts of the molecule are cut out (introns) and the remaining subsequences (exons) are ligated together such that they form again a single RNA sequence. In this way the exon as well as the intron can further act as an RNA molecule [48]. Known ways of splicing and the splicing mechanism in Archaea will be explained in the following subsections. 1. Self-splicing (autocatalytic) group I and II introns Splicing is done due to a transposition mechanism by RNA molecules without further help of proteins. The mechanisms are similar and the transposition is called homing for group I introns or retrohoming for group II introns since here, an RNA intermediate is involved instead of just DNA molecules. Self-splicing introns act as ribozymes (RNA with enzymatic activity). The mobile elements can be found in Eukaryotes and Bacteria, as well as viruses whereas group II introns do not occur in eukaryotic nuclear genomes [57]. Self-splicing introns do not exist in Archaea except a few cases of group II introns reported in Euryarchaeota [58]. 2. Spliceosomal introns The spliceosome is a highly flexible and dynamic ribonucleoprotein (RNP) complex composed of five snrnps and several proteins. Eukaryotic pre-mrna splicing is done with the spliceosome [59]. 3. trna and archaeal introns (enzymatic splicing) trna splicing exists in all domains of life. Whereas in Archaea and Eukarya, trnas are spliced by enzymes [60], bacterial trnas are self-splicing [58]. Archaeal and eukaryal trna endonucleases are closely related to each other [58]. At least a second enzyme, a specific ligase is also included in the trna splicing mechanism. In Archaea, not only trnas are spliced due to this mechanism but also ncrnas and mrnas are known to undergo enzymatic splicing [49]. For Eukaryotes, there exist a few cases, where protein-coding genes are spliced enzymatically [61, 62]. Table 2.3.: Summary of known splicing pathways. A classification of splicing mechanism into the domains of life is shown in Fig

22 2. Biological Background Type Eukarya Bacteria Archaea Group I y y n Group II y y y* Splicesosomal y n n Enzymatic y n y Table 2.4.: Splicing mechanism in the domains of life. In Eukarya, all splicing mechanisms can be found (y). In Bacteria, just self-splicing is possible. Only enzymatic splicing is done in Archaea, except for a few species where self-splicing introns of group II were reported(*) [58] Splicing Mechanism Four different splicing mechanisms are known today. Whereas all mechanisms exist in Eukarya, Bacteria and Archaea are known to use just two of them. As shown in Tab. 2.3 and 2.4, only self-splicing introns (group I and II) can be found in Bacteria [57]. In Archaea, mainly enzymatic splicing is done except in a few cases where selfsplicing group II introns were reported [58]. For Eukarya, most mrnas are spliced by the spliceosome, mainly noncoding RNAs are self-splicing and trnas underlie the enzymatic splicing mechanism. For all cases, a small number of exceptions are known [61, 62]. In contrast to Eukarya where enzymatic splicing is mainly used for trnas, Archaea are known to also splice ncrnas, rrnas and mrnas enzymatically RNAs in Archaea Even though a plethora of different RNA molecules exist, just the ones which are important for this work will be described here. As RNAs are an essential part in the molecular genetics of organisms, they also play an important role in phylogenetic studies. As described in Section 2.1, Archaea share common traits with Eukarya as well as Bacteria. Besides mrna, trna and rrna which are part of the transcription and translation processes, snornas and CRISPR RNAs were found in archaeal species. Box C/D and H/ACA snornas are two families inside the class of snornas and can also be found in Eukarya. So called antisense elements inside the snorna sequence are responsible for the binding of box C/D or H/ACA snornas to their target molecules which are mainly other RNAs. Box H/ACA snornas are barely found in Archaea [49]. However, box C/D snornas are abundant in most of the archaeal species, e.g. in Methanopyrus kandleri 126 different C/D box snornas were found, in Nanoarchaeum equitans which has a small but dense genome, still 22 C/D box snornas could be detected [63]. By now, it is known that many C/D box snornas are often located in intergenic regions, sometimes overlapping the 3 or 5 flanking ORFs [17]. Another important ncrna in Archaea is the CRISPR RNA. This kind of RNA molecule is part of the prokaryotic immune system and can be found in prokaryotic organisms (see also Section 2.1). 14

23 2.2. RNA Processing A B C Figure 2.6.: BHB element scheme. Splicing occurs at the splicing junctions (indicated by arrows) on substrate A. The splicing junctions are located within two bulges linked by a helix forming the BHB motif. After the splicing process exon B and intron C are ligated such that the intron is circularized Splicing in Archaea Since at first Bacteria and Archaea were clustered together in the kingdom of Prokaryotes, they were assumed to use similar splicing mechanisms such as self-splicing. By sequencing and analyzing the 16s RNA in Archaea and Bacteria, Woese et al. could show that Archaea are an own domain of life [6]. Subsequently, traits being distinct among Archaea and Bacteria were discovered, such as the splicing mechanisms [64]. As explained in Table 2.3, four different pathways for the splicing mechanism are known. In Eukaryotes, each of those mechanisms can be found. In Archaea just one splicing mechanism, enzymatic splicing, is known today except in a few cases of group II introns [58]. This mechanism is based on two enzymes, the specific endonuclease and a ligase. In known cases of splicing in Archaea, the exon as well as the intron is ligated after the splicing. This leads to circularized introns which seem to be abundant in archaeal species [21]. This machinery is assumed to be responsible for the splicing of different kinds of RNAs e.g. the maturation of mrna, trna and rrna which also includes 16s and 23s rrna [19, 20, 32]. It could be shown that the endonuclease uses a certain recognition element which seems to be expressed only as a secondary structure element instead of showing a typical sequence motif [36, 41, 1]. The so called BHB motif, meaning bulge-helix-bulge motif, is located in the region around the splicing sites, as depicted in Figure 2.6. The following subsection will deal with the subject of circularized introns and BHB motifs. Another mechanism related to splicing which exists in at least several archaeal species is the so called trans-splicing mechanism. Here, two or more RNA molecules first ligate to each other before the splicing mechanism is executed, as shown in Figure 2.7. In this way, different RNA-halves can be joined to form several combinations and so lead 15

24 2. Biological Background Figure 2.7.: Cis-splicing is the process where an intron is spliced out of an RNA molecule formed by a single continuous RNA sequence. For trans-splicing, two or more RNA (sub)sequences are ligated to form a further distinct RNA structure. Then the intron is spliced out such that the it contains subsequences of former distinct RNA sequences [25]. to diverse RNA molecules. This seems to be a common practice for organisms with a relative small genome and is mostly studied for trna molecules, also called split transfer RNA [25, 26, 27, 28, 29, 40]. In total, most of the archaeal splicing mechanism was discovered based on the splicing of trna molecules. A good summary and comparison between archaeal and eukaryotic can be found in the work by Yoshihisa in 2014 [42] Circularization Circularized RNAs can be found in all domains, including Eukarya, Bacteria, Archaea and Viruses. For the majority of circularized RNAs it is assumed that they are functional even if their function was not found yet [65]. However, circularized RNAs seem to occur very rarely in nature [16]. It is known that many circularized RNAs are a product of the splicing procedure, but appear also as genomes (viruses) or permuted trnas [40]. In Archaea it was thought that circularized RNA molecules are a product of every splicing process which is caused by the BHB element as a recognition element in the secondary structure of the pre-rna [36]. But in recent studies, circularized RNA molecules without evidence for a BHB element were found [1]. So BHB dependent splicing is not the only splicing mechanism existent in Archaea Circularized introns Even though circularization of RNA molecules also occurs in viral genomes and other contexts, this work focuses on circularized RNA molecules which are product of a splicing procedure. In Archaea it is thought that all splicing mechanisms have a circularized intron as an output since this increases stability, being essential when living in extreme environments where most of the archaeal species can be found in (see Subsection 2.1). Not much is known about the splicing process and the ligation thereafter in Archaea. In Brooks et al. the structure for a DNA and a putative RNA ligase was revealed [66]. Here, they could not detect the substrate of the ligase but it seems to be able to also 16

25 2.3. Circularization ligate double stranded structure, which occurs in secondary structure conformations of RNA molecules. Circularized introns can be found in RNA-Seq data based on the splicing sites and reads that do not map linearly on the reference genome. In Chapter 3, finding circularized introns is explained in more detail. Figure 2.8.: Circularization mechanism for ribosomal RNAs in two archaeal species (taken from [67]). Bold lines indicate the mature rrnas and thin lines indicate precursor sequences. Red arrows indicate the splice sites inside the bulges of the BHB elements. After the BHB-dependent splicing process the intron and the exon are further processed, indicated by purple arrows Examples of circular RNA in Archaea Based on numerous examples, it is assumed that most of the splicing processes in Archaea result in a circularized intron. For many cases, the functions for either the exon or the intron is known. In only a single case, namely the box C/D snorna processed from a long intron of the trna-trp precursor in Pyrococcus species [35] the function of the exon as well as the intron is known. For other archaeal box C/D snornas, it is known that most of them are circularized and are located in an intronic region [68]. Typical examples for circularized introns in Archaea are trna introns since pre-trnas undergo a splicing mechanism where at least one but up to three introns are spliced out. Then the ends of the exon, thus the trna molecule, and the ends of the intron are ligated together such that the intron is circularized and the exon can fold into the typical cloverleaf structure of a trna [36, 38, 39, 40, 41, 16, 69]. Figure 2.8 shows the splicing procedure of ribosomal RNAs in two archaeal species. 17

26 2. Biological Background Here, the splicing and circularization of the pre-rna is one step in the maturation process of the ribosomal RNA. Both introns are spliced in a BHB-dependent manner and result in circularized introns containing the RNA sequence. A further splicing step reveals the mature rrna molecules [67] Bulge-Helix-Bulge (BHB) elements The Bulge-Helix-Bulge (BHB) element is mainly formed by its secondary structure [36]. As of now, no common sequence motifs in sequences with a BHB element could be found yet [1]. Recent studies showed that most trna sequences having a circularized intron are also spliced based on the BHB dependent mechanism [28, 42, 49]. Even though, regarding certain splicing events, no BHB elements have been found which leads to the proposal for a second splicing mechanism being expressed in Archaea [1]. Figure 2.9.: Different trnas with BHB elements in S. tokodaii (taken from [36]). The pretrna-lys (A) forms an hbhbh motif at the canonical position. The splicing sites are indicated by the black arrows inside the bulges. After splicing, a 25-nt intron and the mature trna can be found. For cases B and C, splicing also occurs at the canonical position but the BHB element is expressed only as hbh element. Black arrows indicate the splice sites in the bulges whereas the bulge next to the anticodon lies inside the hairpin loop and thus, cannot be detected by looking at the molecule s secondary structure annotation. The anticodon is indicated by a frame around its three nucleotides. 18

2.4. Bulge-Helix-Bulge (BHB) elements 2.4.1. Structure of BHB elements As depicted in Figure 2.6, the BHB motif consists of two bulges enclosing a short helix.

27 2.4. Bulge-Helix-Bulge (BHB) elements Structure of BHB elements As depicted in Figure 2.6, the BHB motif consists of two bulges enclosing a short helix. The splice junctions between the exon and the intron are inside the bulges. After the splicing process, both ends of the intron and the exon are ligated again such that, in this case, a trna molecule and a circularized intron are formed [16]. Marck and Grosjean give a more detailed view on the BHB element [36]. Here, as shown in Figure 2.9, it can be seen that in many cases two more helices are formed. One of the helices lies in the exon, the other one can be found lying entirely in the intron. Due to this reason, Marck and Grosjean proposed to rename the BHB element and call it hbhbh element whereas h and h are the outer helices in intron and exon, B are the bulges and H is the main helix between the bulges [36]. However, in their paper, Marck and Grosjean recognize that there exist BHB elements which miss one of the outer helices such that elements as hbhb or BHBh are formed [36]. Furthermore, the secondary structure part which belongs to a BHB element is short and very flexible in sequence lengths. It is assumed that the bulges mostly have a length of 3 to 4 nucleotides but the helices lengths may vary between 3 to 10 nucleotides [1]. Recently, BHB elements with just one bulge, so either HBh or hbh, were found, which increases their flexibility even more [28]. Figure 2.10.: Comparison of intron positions in archaeal and eukaryotic trnas (taken from [42]). The colors encode the positions in certain groups of organisms whereas the canonical position can be found in most of the archaeal and eukaryotic species. For some cases, several introns can be found in a single pre-trna, including the canonical position and one to three introns at non-canonical positions. 19

28 2. Biological Background Location of BHB elements The location of BHB elements and thus, the splicing sites in archaeal genomes are mainly known for trnas. Even though it is known that splicing and BHB elements exist in other kinds of archaeal RNA [1] more effort has to be invested in characterizing these splicing sites as well as intron and exon structures. At the same time, it was proposed that there might be another recognition element or responsible mechanism for the splicing procedure than the BHB element [1]. It is probable that all of the trna molecules are spliced based on the BHB dependent mechanism [28, 42, 49, 41]. The trna itself is encoded in the exon. The intron is spliced out but nothing is known about its function yet. Just in one case it was announced that a box C/D snorna is encoded in the trna intron [42]. Also in Eukaryotes, trna splicing is known. Here, most of the introns are spliced out at the canonical position, meaning between nucleotide 37 and 38 of the trna sequence. Even though, other non-canonical splicing positions were found. In Archaea, splicing at canonical and non-canonical positions exist, but a lot more splicing events at noncanonical positions were discovered. This is depicted in Figure Splicing events at the canonical position favor the full BHB element, thus the element expressed by hbhbh [42, 28]. At non-canonical positions all variations of BHB elements are found, including half BHB elements, thus shapes as hbh or HBh [42, 28]. Further studies about trnas showed that there exist trna molecules with more than one intron [39] and split trna or permuted trna [40] or even combinations thereof [42]. Split trnas are those trnas participating in trans splicing events. Here, two or more trna-halves are joined together and then spliced. This process was first discovered in Nanoarchaeum equitans [25]. Permuted trnas which exist with or without intron form a short loop bridging the 3 and 5 ends and in this way, the 3 half is positioned upstream of the 5 half [42]. All these types of trnas are proven to form mature and functional trnas in archaeal organisms. Regarding the box C/D snornas in Archaea, it is known, that these RNAs are encoded inside the intron. In this case, the function of the exon is not known [42]. C/D box snornas are found in intergenic regions in the archaeal genomes, sometimes at 3 or 5 ends [17]. These RNA molecules are known to be circularized in the organism [34] and several show evidence for a BHB secondary structure element being responsible for the splicing process [1]. 20

29 3. Computational Methods BHB elements serve as recognition elements for enzymatic splicing. Methods to find known and new splicing sites flanked by BHB elements include mapping and analyzing of RNA-Seq data using segemehl [4, 5] as well as building covariance models with Infernal [3]. In this chapter the methods and programs which were used to search for putative BHB elements are described. Figure 3.1.: The order of start and end position of the circularized intron will differ in the reads and in the reference genome (taken from [16]). A read mapping in a non-linear way to the reference genome is an indication for a circularized RNA Mapping Recent technologies such as RNA sequencing protocols make it possible to sequence the transcriptome of a cell in a high-throughput manner. As the sequenced transcriptome will include numerous different kinds of RNA molecules, analyzing RNA-Seq data will reveal new insights into the transcriptome [18]. During the splicing process a pre-rna molecule is cut into at least two distinct RNA 21

30 3. Computational Methods molecules. These can also be detected in the transcriptome. Thus, the data also includes numerous circularized introns. Further RNase P treatment of the sample will digest all linear RNA molecules and so enrich the relative amount of circularized RNAs [16]. The reads are then sequenced. The resulting data set is analyzed with mapping protocols which assemble the reads or directly map them onto a reference genome. Here, segemehl [4, 5] was used as the mapping program since it provides functionality to find split reads using the --split option when running the program. The search for split reads is complicated by the fact that circular RNAs are sequenced in a way that the read will have a start and an end point which is probably different from the start and end points of the sequence region in the reference genome, namely the splice junctions, as depicted in Fig Additionally, the circularized RNA is not expressed in one single read but can be rejoined using several reads. In this way, circularized RNAs can be found by comparing the reads to the reference genome. If the assembled reads do not map linearly to the reference genome, but map in a non linear way, then this might be a split read belonging to a circularized RNA molecule [16], Fig After the mapping a remapping step follows using the program lack [70]. The remapping is done based on the splicing sites. In this way, multiple splicing sites can be found and previously unmapped reads might be mapped afterwards increasing the total number of mapped reads. Figure 3.2.: Reads mapping onto a circularized introns can be detected if the mapping can be done in a non linear way (taken from [16]). If several reads can be mapped in this way, a splice junctions is assumed to be flanked by the reads. A third program of the segemehl suite transrealign [4, 5] is used to extract the split reads including the length of the intron, information about circularization or multiple splicing sites and putative errors. To detect splice junctions, pairs of reads which probably flank a splice junction are connected internally by a link. The links are labeled by the corresponding read coverage. Due to multiple splicing, repeats or errors, a single read can be connected to several other reads. This is shown in Fig. 22

31 3.1. Mapping chr Pos A Pos B splits:x:y:z:c:p 0 + Table 3.1.: Reads A and B flank a splice junction on chromosome chr, whereas Pos A is the rightmost position in A and Pos B the leftmost position in B being the start and end position of the splice junction. The variables x, y and z stand for read coverages. Here, transrealign outputs a number x of reads covering the splice junction in total, so both splicing sites. The number y of reads is the coverage of just the left splicing site, z of just the right splicing site. After the numbers for read coverages, two variables are reported which indicate if a splice junction is a circularized intron (C) or a normal one (N). The letter P stands for passed. It is possible that one read can participate in several splice junctions with distinct partners. Here, either a multiple (M) splice junction is reported instead of a circular or normal one or in case of too many different read pairs, no clear splice junctions can be found. Then, the letter F for failed is reported instead of the letter P. Fig. 3.3 shows several possible splice junctions between distinct pairs of reads. The number 0 usually indicates a score, which is not used here. The sign + indicates the strand where the splice junction is located Assume, reads A and B flank a splice junction. However, the program also predicts reads A and B, A and B as well as A and B as potential read pairs flanking further splice junctions. The most probable splice junction can be found by taking into account the numbers x, y and z which indicate the read coverage for the subsequences. Depending on their relation and total values the most probable splice junction can be detected. The output pattern of transrealign for each splice junction and a more detailed explanation is given in Fig 3.1. A' A y' y x A'B x x x AB' AB'' z z' z'' B B' B'' Figure 3.3.: Concept of segemehl and transrealign when looking at the splice junctions. A splice junction between A and B is reported (bold lines). Thus, the numbers x, y and z indicating the coverage of both splice junctions, just the left one and just the right one, are included in the output. The numbers x, y ad z possibly differ strongly, depending on how many distinct read pairs flanking splice junctions were found (depicted by dashed lines). 23

32 3. Computational Methods consensus structure: guide tree: 2 3 BEGL 5 4 MATP MATP 7 13 MATR MATP MATL 10 MATL 11 MATL 12 MATL 13 END 14 ROOT 1 MATL 2 MATL 3 BIF 4 BEGR MATL MATP MATP MATL MATP MATL MATL MATL 23 END 24 Figure 3.4.: Consensus secondary structure used to build a covariance model. The guide tree is build based on the consensus secondary structure and forms the underlying data structure for the covariance model (taken from [71]). The nodes in the tree stand for different states. MATP stands for pair, MATL and MATR for the single sequence on the left or right side, BIF means bifurcation, ROOT indicates the root node of the tree, BEGL and BEGR show the beginning of the secondary structure after a bifurcation node on the left or the right side and END indicates a terminal node in the grammar, thus the end of the subsequence parsed in this branch of the guide tree Covariance Models BHB elements are quite flexible in length and distribution of the secondary structure elements. Overall however, their length is limited to a relatively small number of nucleotides and base pairs. Additionally, no sequence motifs were found meaning that only the secondary structure is conserved. However, to find homologous sequences, searching for secondary structure and sequence similarities at the same time seems to be most efficient. Regarding the flexibility of BHB elements, the program Infernal [71] was chosen for their identification. Infernal is based on covariance models (CM), a special case of stochastic context-free grammars (SCFG). Context-free grammars (CFG) are able to recognize palindrome-like structures such as RNA secondary structures. In this way, a CFG can evaluate if a secondary structures for an RNA sequence is valid. In the stochastic CFG, state transitions are used to model how probable certain transitions are and emission probabilities model the possible emissions of this state. This is similar to Hidden Markov Models (HMM). Both probabilities are position dependent parameters and depend on the state. To check 24

33 3.2. Covariance Models for validity of a secondary structure, a parse tree is built. The nodes in the parse tree model states such as pair, single strand left or right, deletion, bifurcation, start and end. There exist an exponential number of parse trees for one nucleotide sequence and this problem is solved in polynomial time via dynamic programming [72, 73]. A covariance model is built based on a multiple sequence alignment with consensus secondary structure annotation. Using several additional rules, a unique parse tree for the consensus secondary structure is built. This parse tree is called a guide tree. The nodes in a guide tree have eight distinct types of nodes, similar to those of the general parse tree. An RNA secondary structure and the corresponding guide tree can be seen in Fig Based on the guide tree, a covariance model is built by extending each emitting state (node) of the guide tree, thus excluding bifurcation and end state (see Fig. 3.5). As in a multiple sequence alignment not every sequence perfectly fits to the consensus, insertions and deletions are included to the model. In each state of the guide tree, several substates are modelled, such as a match state, an insertion state and a deletion state. Transitions with transition probabilities as well as emissions and emission probabilities are added, too [72, 73]. "split set" inserts "split set" inserts "split set" insert MP 12 ML 13 MR 14 D 15 IL 16 IR 17 MP 18 ML 19 MR 20 D 21 IL 22 IR 23 MR 24 D 25 MATP 6 MATP 7 MATR 8 Figure 3.5.: Part of the inner structure of a CM (taken from [74]). MATP6, MATP7 and MATR8 are three nodes of the guide tree indicating two matched pairs and one single sequence on the right side. Inside each node are several states e.g. a match state, a deletion state and states for single strand on each side, where split set denotes that exactly one of these states has to be taken. If an insertion is possible an insertion state is linked in between whereas the insertion can happen on the left or the right side. Links between states are labeled with transition probabilities. In this way, the whole guide tree is extended to become a covariance model. IR S 1 IL 2 IR 3 ML 4 D 5 IL 6 ML 7 D 8 IL 9 B 10 S 11 MP 12 ML 13 MR 14 D 15 IL 16 IR 17 MP 18 ML 19 MR 20 D 21 IL 22 IR 23 MR 24 D 25 IR 26 MP 27 ML 28 MR 29 D 30 IL 31 IR 32 ML 33 D 34 IL 35 ML 36 D 37 IL 38 ML 39 D 40 IL 41 ML 42 D 43 IL 44 E 45 S 46 IL 47 ML 48 D 49 IL 50 MP 51 ML 52 MR 53 D 54 IL 55 IR 56 MP 57 ML 58 MR 59 D 60 IL 61 IR 62 ML 63 D 64 IL 65 MP 66 ML 67 MR 68 D 69 IL 70 IR 71 ML 72 D 73 IL 74 ML 75 D 76 ROOT 1 MATL 2 MATL 3 BIF 4 BEGL 5 MATP 6 MATP 7 MATR 8 MATP 9 MATL 10 MATL 11 MATL 12 MATL 13 END 14 BEGR 15 MATL 16 MATP 17 MATP 18 MATL 19 MATP 20 MATL 21 MATL 22

3. Computational Methods To build a covariance model out of a multiple sequence alignment, the program cmbuild is used which is part of the Infernal suite.

This is important to be able to output e-values when using Infernal on a test set [71]. The original multiple sequence alignments, can be extended using cmalign.

34 3. Computational Methods To build a covariance model out of a multiple sequence alignment, the program cmbuild is used which is part of the Infernal suite. A CM is calibrated using cmcalibrate which applies the CM to sampled sequences and calculates probabilities for finding a certain type of secondary structure. This is important to be able to output e-values when using Infernal on a test set [71]. The original multiple sequence alignments, can be extended using cmalign. The program tries to align a given sequence based on the consensus secondary structure to the multiple sequence alignment which was used to build the CM. To scan whole genomes cmsearch is used, while for sequence databases cmscan is applied [71] Pattern Search For BHB elements, the conservation of the secondary structure motif is known, but by now, no sequence conservation patterns or other small motifs were found. In addition to circularized sequences with BHB element, also sequences were found which do not show evidence for a BHB motif even though they occur next to a circularization junction [1]. To find a putative splicing recognition element, MEME [75] and MEME-SP [76] were used to search for distinct small secondary structure motifs or sequence patterns. The program MEME searches for conserved sequence motifs. The search is based on position-dependent probabilities for each letter. The probabilities are stored in a matrix. The patterns are gap free, therefore, if gaps exist in the input sequence, MEME splits the sequences in several parts and thus, as well the motifs. As an output MEME shows the motifs which were found as well as the number of sequences, mismatches and locations [75]. Fig. 3.6 shows a typical box C/D snorna motif found with MEME. Figure 3.6.: Typical sequence motif for a box C/D snorna with C box motif TGATGA and D box motif CNGA [75]. Box C and box D motifs can be very similar to C and D box motifs but are less conserved and thus, show more variance in the sequence motifs. To search for secondary structure motifs, MEME-SP [76] was used. MEME-SP uses the same expectation maximization framework as MEME, but learns from sequences that, 26

35 3.3. Pattern Search in addition, are annotated with secondary structure profiles. These specify, for each position of each input sequence, the probability that the base is contained in a stem, an interior loop, or a bulge. This secondary structure information is computed for all locally stable suboptimal structures that can be formed, as hybridization structures form the sequence surrounding the circularization sites. The suboptimal hybridization structures are computed by RNAduplex [77] with parameter -e 5 limiting the energy band of interest to the 5 kcal/mol above the most stable structure. 27

37 4. Workflow BHB elements were searched by looking at secondary structure motifs around circularized introns as well as identifying new srnas due to the detection of splicing junctions flanked with BHB elements. In this chapter, the steps done during the analysis are described. The workflow is sketched in Fig. 4.1, including data sets, methods and the evaluation step. The computational methods are described in more detail in Chap. 3. The results recovered during the evaluation can be found in the following chapter, Chap. 5. This chapter is an extended version of the methodological section in Berkemer et al. [2]. Figure 4.1.: Workflow of the genome-wide survey for BHB elements and circular or spliced RNA. From a cache of known circular RNA sequences as well as known box C/D snornas two curated multiple-sequence alignments (MSA1 and MSA2) were created that serve as the basis for a stochastic Infernal covariance model. These are evaluated against RNA-seq data and putative BHB elements found in several of archaeal genomes. Known circularized RNA without BHB elements were further subjected to sequence and structure motif discovery using MEME and MEME-SP. 29

38 4. Workflow 4.1. Known and Putative BHB Elements BHB elements have been described extensively for trnas [36] and rrnas. The alignments were extended with BHB elements from ref. [36] and with trna associated BHB elements from ref. [34]. Most archaeal trnas, including the spliced ones, but with the notable exception of split and permuted trnas are readily identified by trnascan-se [78]. Therefore trnascan-se -A was used for all genomes to complete trna data. A multiple sequence alignment, MSA1, was manually built from sequences with known BHB elements. It emphasizes the well-conserved secondary structure pattern, Figure 4.2(top). To investigate the box C/D snornas a second alignment was constructed, MSA2, from 15 Nanoarchaeum equitans snorna sequences with unambiguous box C and box D motifs annotated in ref. [68]; the construction of the alignment was done in a semi-automatic fashion. Since these short RNAs (length nt) exist as circularized molecules, putative BHB elements similar to those of trna introns should be detectable overlapping both circularizing junctions if they are indeed processed in the same manner as trnas. Therefore 15 nt flanking sequence was included, Figure 4.2(bottom). In addition the known box C/D snorna sequences of Sulfolobus solfataricus [16, 79], Sulfolobus acidocaldarius [80, 16], Nanoarchaeum equitans [68], and Methanopyrus kandleri [34] were retrieved together with 15 nt flanking sequence. For the construction of MSA2 LocARNA [81] was used. This tool simultaneously infers an alignment and a consensus secondary structure from a set of unaligned RNA sequences taking into account sequence similarity, structure similarity, and thermodynamics [82]. To include further knowledge, in the case C and D boxes as well as the kink turn motifs of the snornas, a constrained LocARNA alignment was generated anchored at annotated columns of an initial manual alignment Circular transcripts Known circular srnas were compiled from Sulfolobus solfataricus [16, 79], Sulfolobus acidocaldarius [80, 16], Nanoarchaeum equitans [68], and Methanopyrus kandleri [34]. Publicly available RNA-seq data of S. acidocaldarius [83, 84], Nanoarchaeum equitans and Ignicoccus hospitalis [68], Sulfolobus solfataricus [79] and M. kandleri [34] were mapped and analyzed as outlined in [18]. In brief, read data sets for each species were pooled, the reads were quality-trimmed and then mapped to the corresponding genome using segemehl [4, 5] with the option --splits that forces reads that do not acceptably match as a single substring to be split and matched across splice sites. It was required that a seed of at least 11 nt maps to each side of the split. Aiming to find putative splicing sites, a remapping was done using lack [70], another program provided by the segemehl suite. The statistics of the mappings are summarized in Table 4.1. The transrealign tool, a component of the segemehl suite [5], was used to identify the circularizing splice junctions. A circularization site was retained if transrealign reported a backsplicing junction and a threshold of 5 reads covering the splice junction. 30

MSA2 below comprises box C/D snornas from N. equitans [68].

39 4.2. Circular transcripts Figure 4.2.: Consensus multiple alignment of BHB structural elements. MSA1 (top) is built from trnas with introns from [34] (block A) and [36] (block B). MSA2 below comprises box C/D snornas from N. equitans [68]. The two cleavage sites in the bulges (labeled B1 and B2 and marked by BBB at the bottom) are indicated by arrows. Base pairs are denoted by [,]. The three helical regions are highlighted and labelled H, h, and g, resp. 31

something about srna in archaea

something about srna in archaea or: Processed Small RNAs in Archaea and BHB Elements Sarah Berkemer Bioinformatics Vienzig Archaea? Sarah Berkemer (Bioinformatics Vienzig) BHB elements in Archaea 2 / 23