Annotation of Drosophila grimashawi Contig12

Similar documents
GEP Annotation Report

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

BLAST. Varieties of BLAST

HMMs and biological sequence analysis

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

1. In most cases, genes code for and it is that

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Using Bioinformatics to Study Evolutionary Relationships Instructions

Comparative genomics: Overview & Tools + MUMmer algorithm

BIOINFORMATICS LAB AP BIOLOGY

Comparing whole genomes

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Bioinformatics Chapter 1. Introduction

Sequence analysis and comparison

O 3 O 4 O 5. q 3. q 4. Transition

-max_target_seqs: maximum number of targets to report

GCD3033:Cell Biology. Transcription

A Browser for Pig Genome Data

Procedure to Create NCBI KOGS

Basic Local Alignment Search Tool

Supplemental Materials

Synteny Portal Documentation

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Small RNA in rice genome

Genomics and bioinformatics summary. Finding genes -- computer searches

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

Heuristic Alignment and Searching

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

EBI web resources II: Ensembl and InterPro

Bioinformatics and BLAST

SUPPLEMENTARY INFORMATION

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Student Handout Fruit Fly Ethomics & Genomics

objective functions...

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

In Genomes, Two Types of Genes

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Whole Genome Alignments and Synteny Maps

From Gene to Protein

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Computational Biology: Basics & Interesting Problems

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome-wide analysis of the MYB transcription factor superfamily in soybean

Comparative Bioinformatics Midterm II Fall 2004

RNA Processing: Eukaryotic mrnas

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Hands-On Nine The PAX6 Gene and Protein

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Tools and Algorithms in Bioinformatics

Genomics Education Partnership Fosmid 16 Harry Quedenfeld. Data and text from this paper is allowed to be included in publications

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Chapter 17. From Gene to Protein. Biology Kevin Dees

Eukaryotic vs. Prokaryotic genes

Bio2. Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding. Biological Databases (sequences) Armstrong, 2007 Bioinformatics 2

Multiple Choice Review- Eukaryotic Gene Expression

Algorithms in Bioinformatics

Bioinformatics for Biologists

RNA- seq read mapping

Homology and Information Gathering and Domain Annotation for Proteins

Chapter 15 Active Reading Guide Regulation of Gene Expression

Sequences, Structures, and Gene Regulatory Networks

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

2 Genome evolution: gene fusion versus gene fission

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Computational Biology and Chemistry

Protein function prediction based on sequence analysis

EECS730: Introduction to Bioinformatics

BME 5742 Biosystems Modeling and Control

15.2 Prokaryotic Transcription *

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Practical considerations of working with sequencing data

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Large-Scale Genomic Surveys

Bioinformatics Exercises

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction

Alignment & BLAST. By: Hadi Mozafari KUMS

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Transcription:

Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes 3 2.1 Genscan Feature 12.4............................................. 3 2.1.1 Genome Browser: First Look..................................... 5 2.1.2 BLASTP................................................ 5 2.1.3 Flybase Gene Record Finder...................................... 5 2.1.4 TBLASTN............................................... 6 2.1.5 Genome Browser for Intron/Exon Boundaries............................ 6 2.1.6 Gene Checker............................................. 8 2.2 Genscan Feature 12.3............................................. 8 2.2.1 Genome Browser: First Look..................................... 8 2.2.2 BLASTP............................................... 8 2.2.3 Flybase Gene Record Finder...................................... 9 2.2.4 TBLASTN............................................... 9 2.2.5 Genome Browser........................................... 9 2.2.6 Gene Checker............................................. 10 2.3 Genscan Feature 12-2............................................. 10 2.3.1 Genome Browser: First Look..................................... 10 2.3.2 BLASTP................................................ 10 2.3.3 BLASTX................................................ 10 2.3.4 BLAT................................................. 10 2.3.5 BLASTX: Predicted Exon Only.................................... 13 2.3.6 Genome Browser........................................... 14 2.3.7 BLASTP................................................ 14 2.3.8 CLUSTALW: Predicted Exon Only.................................. 14 2.3.9 Conclusion............................................... 14 2.4 Genscan Feature 12.1............................................. 16 2.4.1 Genome Browser: First Look..................................... 16 2.4.2 BLASTP................................................ 16 2.4.3 BLASTX................................................ 16 2.4.4 BLAT................................................. 16 3 Repeats 16 4 CLUSTALW 16 4.1 CG2177-PC.................................................. 16 4.1.1 First Alignment............................................ 16 4.1.2 Alignment without S. cerevisiae................................... 20 4.2 CG32850-PA.................................................. 20 5 Synteny 20 6 Conclusion 23 1

7 Appendix 24 7.1 CG2177 Ortholog - B isoform......................................... 24 7.1.1.fasta le................................................ 24 7.1.2.g le................................................. 24 7.1.3.pep le................................................ 24 7.2 CG2177 Ortholog - C Isoform......................................... 24 7.2.1.fasta le................................................ 24 7.2.2.g le................................................. 25 7.2.3.pep le................................................ 25 7.3 CG32850 Ortholog............................................... 25 7.3.1.fasta le................................................ 25 7.3.2.g le................................................. 25 7.3.3.pep le................................................ 25 List of Figures 1 Initial genome browser view.......................................... 3 2 Map of nal annotation............................................ 4 3 BLASTP of Genscan feature 12.4 against D. melanogaster annotated protein database.......... 5 4 CG2177 isoforms in melanogaster....................................... 6 5 The exons and isoforms of CG2177 mapped onto the D. grimshawi contig. As discussed in Section 2.1.4, the rst 13 amino acids of exon 5_941 (represented here by a red bar) do not align between D. grimshawi and D. melanogaster. Coupled with data from the Genome Broswer (Section 2.1.5), this suggests that that the A isoform has no ortholog in D. grimshawi.............................. 7 6 The beginning of the annotated CG2177 ortholog. The start codon in frame 3 is the true start to the gene (orthologous to exon 4_942). The two stop codons in the same frame and the lack of a start codon rule out the presence of an exon orthologous to 5_941........................... 7 7 BLASTP results for GENSCAN feature 12.3................................. 8 8 Possible splice acceptors for exon orthologous to 3_959. The boundary that was ultimately annotated is represented by the second red box from the left................................ 9 9 Text results of BLASTX search of section of contig12 relevant to Genscan feature 12.2 to NCBI NR Database.................................................... 11 10 BLAT results of region of contig 12 relevant to Genscan feature 12.2.................... 12 11 Example of BLAT browser view for one of the results shown in Figure 10................. 12 12 Results of the BLASTX search for the exonic sequence of Genscan Feature 12.2.............. 13 13 Genome browser view showing the region around Genscan feature 12.2................... 14 14 BLASTP against the NCBI NR database of the amino acid sequence of the predicted near Genscan feature 12-2 in frame -2................................................. 14 15 CLUSTALW alignment of proposed protein sequence related to Genscan feature 12-2 and protein sequences from top ve BLASTX hit (see Sections 2.3.8 and 12)............................ 15 16 BLAT results of region of contig 12 relevant to Genscan feature 12.1.................... 17 17 Example browser views for one of the results shown in Figure 16...................... 17 18 Summary and detailed Repeat Masker output................................ 18 19 CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae.............. 19 20 Repeat of the alignment shown in Figure 19 with the S. cerevisiae sequence removed........... 21 21 CLUSTAL alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi................................... 22 22 Annotated melanogaster genes between bases 306,470 and 340,172 of the dot chromosome........ 22 23 Summary table of RepeatMasker output from run on D. melanogaster................... 23 List of Tables 1 Summary of annotations. Feature regions extend from the predicted promoter to the predicted polyadenylation site. Figure 2 shows only the exons of the predicted features..................... 3 2

Figure 1: Initial genome browser view Feature Region Putative Annotation Genscan 12.1 12,593 to 15,160 Unmasked repeat Genscan 12.2 25,554 to 15,455 Unknown coding gene Genscan 12.3 29,314 to 25,609 CG32850 ortholog Genscan 12.4 33,312 to 29,479 CG2177 ortholog Table 1: Summary of annotations. Feature regions extend from the predicted promoter to the predicted polyadenylation site. Figure 2 shows only the exons of the predicted features. 2 TBLASTN results for CG2177........................................ 6 3 Gene model corresponding to Genscan feature 12.4. Note that 5_941 has no ortholog in grimshawi.... 7 4 TBLASTN results for CG32850........................................ 9 5 Model for CG32850 ortholog.......................................... 10 6 Synteny comparisons of melanogaster and grimshawi chunks.. Lengths of genes are given from start codon to stop codon. Distances between genes are given from start codon to start codon......... 20 7 Comparison of repeats in melanogaster and grimshawi............................ 24 1 Overview In this paper, I present the annotation of the newly nished contig12 of the Drosophila grimshawi dot chromosome. This annotation was done cheiy by using comparative genomic methods to conrm or refute predictions made by the de novo gene predictor, Genscan. Synteny comparisons with D. melanogaster were also taken into account and are discussed. Genscan's predictions generally coincided with those of other de novo gene predictors (see Figure 1). The nal annotation (shown in Figure 2) includes two genes, CG32850 and CG2177 which correspond to Genscan features 12.3 and 12.4 and can be found in the contig at bases 29,314-25,609 and 33,312-29,479 respectively. Multi-sequence alignment of these genes to orthologous genes in other species was performed using CLUSTALW to gain insight on their evolutionary history. The other two genscan-predicted features were annotated as repetitious sequence left unmasked by preprocessing with RepeatMasker. These features contribute to the overall high repeat density of the contig (60.93% not including the features). 2 Genes 2.1 Genscan Feature 12.4 My rst signicant annotation began with an examination of Genscan feature 12.4. The procedure described below is quite typical of a straightforward annotation of a highly probable ortholog. 3

Figure 2: Map of nal annotation 4

Figure 3: BLASTP of Genscan feature 12.4 against D. melanogaster annotated protein database 2.1.1 Genome Browser: First Look I began by looking at the genome browser to get a general impression of the feature and what to expect. As shown in Figure 1, the initial genome browser view, feature 12.4 aligns quite strongly with predictions from all of the other gene predictors as well as the BLASTX alignment to D. melanogaster proteins and several other sources of Drosophila genomic data. 2.1.2 BLASTP I then obtained the sequence of the predicted Genscan feature from the provided output and used it as the query in a BLASTP search against the D. melanogaster annotated protein database. The best-matching gene from D. melanogaster would be examined in subsequent steps for homology. The results for feature 12.4 (Figure 3) showed strong similarity to the D. melanogaster protein CG2177 gene. 2.1.3 Flybase Gene Record Finder I began conrming homology by looking up the gene in Flybase and obtaining the sequences of all of the gene's exons from the CDS Translations sections. In the case of CG2177, there are several isoforms, all of which are combinations of the exons CDS_FBgn0039902:1_941, CDS_FBgn0039902:2_941, CDS_FBgn0039902:4_942, CDS_FBgn0039902:5_941, and CDS_FBgn0039902:2_942. From here on, these exons will be referred to with everything before the colon truncated for simplicity. All three of these 5

Figure 4: CG2177 isoforms in melanogaster Exon Length (amino acids) Aligned Region in Exon Aligned Region in contig Frame 1_941 168 1 to 168 30,634-31,185 1 2_941 95 1 to 95 30,249-30,563 3 4_942 33 1 to 33 30,081-30,179 3 5_941 45 13 to 35 30,081-30,179 3 2_942 32 1 to 32 30,468-30,563 3 Table 2: TBLASTN results for CG2177 isoforms (CG2177-RA 1, CG2177-RB, and CG2177-RC ) are conrmed in melanogaster according to the record for this gene in Flybase. The Gene Record Finder also contains links to the Flybase page discussing the gene's function. The CG2177 protein has activity as a metal-ion trans-membrane transporter, which is not conrmed in vivo, but is predicted from InterPro electronic annotation. It has only one known allele. 2.1.4 TBLASTN I then used TBLASTN to systematically align each exon with the entire masked contig. Each search typically returned exactly one plausible alignment. I noted the ends of the alignment on the contig as putative exon boundaries, as well as the frame for the alignment. (See Table 2) If a signicant section of the query was missing on either end of the alignment (e.g the rst or last 10 amino acids), this was also noted. The entire length of the query (exon) aligned to the subject (nucleotides) for all exons except for 5_941. In 5_941, the rst 12 amino acids of the 45 amino acids do not align. Interestingly, exons 5_941 and 4_942 are exactly the same except for the rst 12 amino acids of exon 5_941. Increasing the expect value threshold by two orders of magnitude fails to give any alignments including the rst 12 amino acids for exon 5_941 in the same frame. This makes it very likely that exon 5_941 in D. melanogaster does not have an ortholog in D. grimshawi. This will be almost certainly conrmed if there is no start codon approximately 12 amino acids before the putative ortholog to 4_942, as shown in the genome browser. 2.1.5 Genome Browser for Intron/Exon Boundaries Using the putative exons identied in the TBLASTN search as a guide, I returned to the genome browser to locate the exact sites of the gene's start codon, stop codon, and intron/exon boundaries. At the same time, I could also observe how well this gene model corresponded to the original gene model predicted by Genscan. An example of views used to identify exon boundaries in of the CG2177 gene is shown in Figure 6. Note that this includes the beginning of exon 5_941, which required some special consideration because the rst 12 amino acids did not align with D. melanogaster. However, there are no other possible start codons anywhere within a reasonable distance of the alignment. There are also several in-frame stop codons. Missing exons between species is quite rare, however the missing and presence of start codons is very strong evidence against the presence of exon 5_941 in D. grimshawi, so I will not include an ortholog to 5_941 in my nal annotation. The nal putative gene model for CG2177 in D. grimshawi is shown in table 3. This corresponds very well with the Genscan prediction as well as the TBLASTN alignments. 1 Note: RA, RB, RC, etc. refer to the processed mrna transcripts of each of the isoforms of a gene, whereas PA, PB, and PC refer to the resulting translated proteins. When the distinction is relevant, an eort has been made to use the appropriate label, however R vs. P labeling is often used interchangeably. 6

Figure 5: The exons and isoforms of CG2177 mapped onto the D. grimshawi contig. As discussed in Section 2.1.4, the rst 13 amino acids of exon 5_941 (represented here by a red bar) do not align between D. grimshawi and D. melanogaster. Coupled with data from the Genome Broswer (Section 2.1.5), this suggests that that the A isoform has no ortholog in D. grimshawi. Figure 6: The beginning of the annotated CG2177 ortholog. The start codon in frame 3 is the true start to the gene (orthologous to exon 4_942). The two stop codons in the same frame and the lack of a start codon rule out the presence of an exon orthologous to 5_941 Dmel CG2177 Ortholog Frame First bp after splice acceptor Start Codon Last bp before splice donor Stop codon 1_941 1 30,632 X X 31,186-31,188 2_941 3 30,249 X 30,564 X 4_942 3 X 30,081-30,083 30,179 X 5_941 3 X X X X 2_942 3 X 30,468-30,563 30,564 X Table 3: Gene model corresponding to Genscan feature 12.4. Note that 5_941 has no ortholog in grimshawi. 7

Figure 7: BLASTP results for GENSCAN feature 12.3 2.1.6 Gene Checker Finally, having come up with boundaries for each exon, I constructed gene models for each isoform of the protein and ran each through the gene checker (one check for each isoform), which checks that the gene model obeys basic biological rules for plausible genes. Since exon 5_941 is predicted to be missing in D. grimshawi and is necessary for the A isoform, I only ran the gene model checker for the isoforms predicted to be orthologous to the B and C isoforms. Both models passed. 2.2 Genscan Feature 12.3 2.2.1 Genome Browser: First Look As shown in Figure 1 (the initial genome browser view) as with feature 12.3, feature 12.4 aligns quite strongly with predictions from all of the other gene predictors as well as the BLASTX alignment to D. melanogaster proteins and several other sources of Drosophila genomic data. 2.2.2 BLASTP Running the same BLASTP search as described in section 2.1.2 gives the results shown in Figure 7. The best match is to the D. melanogaster CG32850 gene (PA isoform) by over 100 points and over 46 orders of magnitude. Matches to other genes also have e values of much less than 10 4, so these warrant investigation, but they all align in a region between amino acids 90 and 130 in the query, so these matches may indicate an orthologous protein functional motif that is present in many dierent genes rather than a gene-gene orthology relationship. Looking up the function of the CG32850 gene in Entrez reveals that it has both protein binding and zinc-ion binding functions. No part of the gene is annotated as a functional motif, but a cursory examination of the Entrez gene records for several of the weaker matches in the TBLASTN search shows that all genes sampled also have protein-binding and zinc-ion binding functions, which supports the functional motif hypothesis. 8

Exon Length (amino acids) Aligned Region in Exon Aligned Region in Contig Frame 2_959 31 1 to 31 29,102 to 28,998-2 3_959 117 9 to 117 28,924 to 28,595-3 Table 4: TBLASTN results for CG32850 Figure 8: Possible splice acceptors for exon orthologous to 3_959. represented by the second red box from the left. The boundary that was ultimately annotated is 2.2.3 Flybase Gene Record Finder According to the Gene Record Finder, CG32850 has two coding exons and a single isoform. The exons are called CDS_CG32850:2_959 and CDS_CG32850:3_959 (hereafter referred to as exons 2_959 and 3_959 respectively). They appear in the fully translated gene in order of their numbering. All exons are on the + strand of melanogaster chromosome 4. There is no additional information in Flybase about the function of the gene or its translated protein or its number of observed alleles. 2.2.4 TBLASTN The results of TBLASTN searches similar to those performed in section 2.1.4 are shown in table 4. 2.2.5 Genome Browser Exon 2_959 A putative exon orthologous to exon 2_959 was found in the genome browser that perfectly matched the prediction from the TBLASTN alignment. In frame -2 there is a start codon at 29,102-29,100 and a GT intron donor site from 28,997-28,996 (phase 0). Exon 3_959 There is a stop codon in frame -3 at 28,597-28,595, which matches the TBLASTN prediction for exon 3_595 perfectly. Identication of a putative splice acceptor site was slightly more dicult. There were ve possible sites that were all in phase 0 relative to frame -3. These sites corresponded to the rst base of the exon being at 28,927, 28,939, 28,942, 28,945, and 28,966. See Figure 8 The rst of these puts the beginning of the exon closest to the 9

Dmel CG32850 Ortholog Frame First bp after splice acceptor Start Codon Last bp before splice donor Stop codon 2_959-2 X 29,102-29,100 28,998 X 3_959-3 28,945 X X 28,597-28,595 Table 5: Model for CG32850 ortholog. beginning of the alignment to melanogaster exon 3_959. However, the alignment did not include 8 amino acids from the query, which lead me to believe that the splice site was actually at a higher base number. There is a stop codon in frame -3 beginning at base 28,948, so the splice site could not be at any base number higher than that. These two lines of evidence lead me to choose the splice site composed of bases 28,947 and 28,946. Further evidence for this splice site is supplied by the genome browser (which labels it as a high-likelihood acceptor) and by the fact that this splice site is only 3 bases downstream from where we would expect to nd it if all 8 amino acids missing in the alignment were included in the exon. 2.2.6 Gene Checker The nal model for the grimshawi putative ortholog to CG32850 is shown in table 5. It passes the gene checker. 2.3 Genscan Feature 12-2 2.3.1 Genome Browser: First Look Genscan feature 12-2 is a predicted single exon gene in the middle of two long stretches of repetitious sequence. Single exon genes are not common, so this piece of evidence alone suggests that it may be a mis-prediction by Genscan. Its location in the middle of repetitious sequence may suggest that it is actually repetitious sequence missed by RepeatMasker. Finally, Genscan is the only gene predictor that predicted a gene in this region, which is evidence in support of this feature being a miscall by Genscan. 2.3.2 BLASTP A BLASTP search with an e value threshold of 1 of the Genscan-predicted amino-acid sequence for feature 12-2 reveals no signicant matches to D. melanogaster or any of the other species available in FlyBase. (Done in two searches: one against only D. melanogaster, the other against all species.) A search against all species with the low-complexity lter turned o returns the same results. 2.3.3 BLASTX To collect further evidence that this is not a signicant feature, I extracted the entire region of the contig from the beginning to the end of Genscan feature 12-2 (bases 15455 to 25554) using the EMBOSS tool extractseq on the gep server. I then ran a BLASTX search against the NCBI NR database using this extracted region as the query. This search returns a large number of extremely signicant alignments (ranging from 1e-162 to 1e-30 in the rst 30 hits) to reverse transcriptase genes in many dierent Drosophila species. (See Figure 9) It also has matches of similar strength to predicted refseq proteins, but coupled with this feature's proximity to large repetitious elements as seen in the genome browser (see Figure 1), it is highly likely that this feature is the result of repetitious sequence that RepeatMasker failed to mask. 2.3.4 BLAT To further conrm that this feature is due to repetitious sequence rather than a true gene, I did a BLAT search of the extracted region against the entire grimshawi genome on the UCSC server. The results of this search are shown in Figure 10. As we would expect for repetitious sequence, the region aligns to many dierence places in the genome with a high level of similarity along the entire length of the query. Theoretically, this could still represent a large gene family with many dierent members throughout the genome. To rule out this possibility, I opened the genome browser view for a sampling of the top alignments in the BLAT search. An example of one of these views is shown in Figure 11. There were no alignments to any melanogaster proteins in any of these views, which provides further evidence that this feature is repetitious non-protein coding sequence. 10

Figure 9: Text results of BLASTX search of section of contig12 relevant to Genscan feature 12.2 to NCBI NR Database 11

Figure 10: BLAT results of region of contig 12 relevant to Genscan feature 12.2 Figure 11: Example of BLAT browser view for one of the results shown in Figure 10 12

Figure 12: Results of the BLASTX search for the exonic sequence of Genscan Feature 12.2. 2.3.5 BLASTX: Predicted Exon Only Genscan predicts that this feature has only one exon, which occurs between bases 25,321 and 25,073 inclusive. The polyadenylation signal for this gene, however, begins at base 15,455. There is a signicant amount of masked repetitious sequence between base 25,073 and base 15,455. Since I included every base between the beginning of the promotor and the end of the polya signal in the extracted region that I used for the above BLASTX and BLAT searches, the results of these searches could be signicantly confounded by the repetitious sequence. I therefore repeated the BLASTX search using only the predicted exonic region (base 25,321 to base 25,073). The results of the BLASTX search are shown in Figure 12. There are four fairly strong (e 10 20 ) alignments to predicted proteins that extend between base 60 and base 235 of the query and three more alignments to predicted protiens in around the same region that are slightly less strong (e 10 13 ). Looking at the top six alignments from the BLASTX search, it is clear that there is a stretch of 29 amino acids (starting around translated base base 149 and ending around base 62) that is particularly well conserved and is almost identical in the query and all of the predicted proteins. This conservation may be indicative of function, which would suggest that this sequence is actually transcribed and translated. (This is further discussed in Section 2.3.8). It is interesting to note, however, that all of the alignments occur in frame -2 of the contig (which corresponds to frame -3 of the BLASTX query), while the exon of Genscan feature 12-2 is predicted to occur in frame -3 of the contig. 13

Figure 13: Genome browser view showing the region around Genscan feature 12.2 Figure 14: BLASTP against the NCBI NR database of the amino acid sequence of the predicted near Genscan feature 12-2 in frame -2. 2.3.6 Genome Browser A view of the Genome Browser around Genscan feature 12-2 is shown in Figure 13. There are start codons very close together in both frame -2 and frame -3 (the frames of the BLASTX alignments and the Genscan prediction respectively). The rst stop codon after the start codon in frame -2 can also be seen and is reasonably close the to stop codon associated with the Genscan prediction. From the BLASTP and BLASTX searches, it is unlikely that the exon predicted by Genscan is actually translated. However, it is possible that the there is a real exon in frame -2. Under this hypothesis, the presence of start and stop codons in frame -2 is consistent with the results of the BLASTX alignment discussed in Section 12. 2.3.7 BLASTP To conrm this new hypothesis, that the real translated exon extends from base 25,328 to base 25,135 of the contig in frame -2, I used the EMBOSS toolset to extract the region and translate it in the appropriate frame. I then used the translated sequence as the query in a BLASTP search against the NCBI NR database. The results, shown in Figure 14, look very similar to the results of the exon-only BLASTX alignments. 2.3.8 CLUSTALW: Predicted Exon Only As discussed in Section 2.3.5, one region in particular was similar between the query sequence and all of the aligned proteins in the BLASTX search described in the same section. To get a more precise illustration of this similarity, I obtained the protein sequences from the top 5 hits and aligned them using CLUSTALW with the predicted protein sequence of Genscan feature 12-2. (I excluded the sixth best BLASTX hit because it came from the same species as the fth and represented an extremely similar predicted protein.) The results of the alignment, shown in Figure 15, conrm that one end of the protein is signicantly more conserved than the other. 2.3.9 Conclusion The evidence of dierential cross-species conservation revealed by the CLUSTALW and BLASTX alignments discussed above are enough for me to conclude that the region between base 23,328 and base 25,135 is translated in frame -2. The hypothesis that this feature is actually unmasked repetitious sequence I now believe to be incorrect. The evidence that supports such a hypothesis can be largely explained by a dearth of conrmed similar proteins and the inclusion of a large amount of repetitious sequence in my earlier searches. 14

Figure 15: CLUSTALW alignment of proposed protein sequence related to Genscan feature 12-2 and protein sequences from top ve BLASTX hit (see Sections 2.3.8 and 12). 15

Since there were no alignments to known proteins or domains in any of the searches, I am unable to make any prediction about this protein's function at this time. Since I could not identify an ortholog in D. melanogaster I am unable to run the predicted gene through the Gene Model Checker. 2.4 Genscan Feature 12.1 2.4.1 Genome Browser: First Look Genscan feature 12.1 is in the same region as predictions from several other gene predictors. It also aligns with a stretch of the X chromosome from D. melanogaster, so if it does turn out to be a true orthologous gene, it may represent an event where a gene has been translocated between two chromosomes. Finally, like feature 12.2, it is found in the middle of repetitious elements, so it may turn out to be another unmasked repetitious sequence. 2.4.2 BLASTP The same BLASTP searches as performed for feature 12.2 (section 2.3.2) returned no signicant hits. 2.4.3 BLASTX I then performed the same BLASTX searches as I did for feature 12.2 (section 2.3.3) using the extracted region built from the Genscan prediction (bases 12,593 to 15,160). This time there were many very strong hits to predicted genes, the vast majority of which cluster in the rst 1200 bases of the extracted sequence, but no hits to the reverse transcriptase genes seen in feature 12.2. There were no hits to any conrmed genes. Unlike Genscan feature 12.2, the beginning of the promotor of Genscan feature 12.1 and the end of the polya sequence form fairly tight bounds around the predicted exonic sequences of feature 12.1, which makes this kind of BLASTX search (and the following BLAT search) more informative in this case than it was in the case of Genscan feature 12.2. 2.4.4 BLAT I then performed the same BLAT search for this feature as for feature 12.2 (see section 2.3.4). The results (shown in Figure 16 ) were similar and showed a large number of hits with a high percentage of identity to the query (89-95%). About 20% of these hits spanned almost the entire query sequence, and the remaining 80% aligned to either the rst half of the query or the second half of the query. As before, these results are consistent with the hypothesis that feature 12.1 represents unmasked repetitious sequence. As before, it is still theoretically possible that this could represent an extremely large gene family, so I looked at the genome browser view for some of the matches. Some showed results that are extremely implausible for a real gene (e.g. alignments do not correspond to gene predictions or comparative genomics tracks, implausibly large and frequen introns), and others show extremely strong alignments to RepeatMasked sequence (see Figure 17), both of which further support my hypothesis that this feature represents unmasked repeat. 3 Repeats According to RepeatMasker, contig12 is 60.93% repeat, although, as discussed in sections 2.3 and 2.4 above, I believe that the actual percentage is signicantly more. See Figure 18 for the full table of results. 4 CLUSTALW 4.1 CG2177-PC 4.1.1 First Alignment Since the alignment of the D. grimshawi gene to the D. melanogaster gene was so strong (on the order of 10 92, see Figure 3) and there were immediately available putative orthologs for many distantly related species, I decided to do a CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae. This alignment, shown in Figure 19, showed relatively little conservation of the amino acid sequence of this gene, though the region corresponding to the grimshawi amino acids number 135 to 253 showed signicantly more conservation than the rest of the protein. 16

Figure 16: BLAT results of region of contig 12 relevant to Genscan feature 12.1 Figure 17: Example browser views for one of the results shown in Figure 16 17

Figure 18: Summary and detailed Repeat Masker output 18

Figure 19: CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae. 19

melanogaster grimshawi Length of CG2177 1,091 1105 Length of CG32850 3,392 505 Distance between the two genes 5,530 979 Table 6: Synteny comparisons of melanogaster and grimshawi chunks.. Lengths of genes are given from start codon to stop codon. Distances between genes are given from start codon to start codon. One explanation for such conservation is the presence of a conserved functional domain. However, a BLAST search of the protein sequence reveals that that only strong alignments to conserved functional domains are to the ZIP superfamily in the region of amino acids 51 to 156 in the grimshawi sequence. While the small overlap between the conserved sequence and the alignment to the ZIP superfamily domain may be signicant, it is unlikely that this explains the extra conservation entirely. The conserved region likely serves some important function in the protein, but as of now it is impossible to tell exactly what. 4.1.2 Alignment without S. cerevisiae Looking at the output from the rst alignment, it is clear that there is signicantly more similarity between the H. sapiens, C. elegans, D. melanogaster, and D. grimshawi proteins than there is between any one of those proteins and the S. cerevisiae protein. I therefore excluded S. cerevisiae and repeated the alignment. The results are shown in Figure 20. In this alignment, there is a region that is somewhat conserved between amino acid 10 and amino acid 60 of the D. grimshawi protein. There are also four regions of very strict conservation between amino acids 96 and 247 of the D. grimshawi protein. Again, very little conservation is seen in the region identied as resembling the ZIP superfamily domain, which leads me to belive that the similarity to ZIP is coincidental and not indicative of this protein's function. It is likely that the protein's function is governed by the amino acids that are highly conserved between the species, but I am not able to generate any hypothesis about what that function might be through standard comparative genomics methods.. 4.2 CG32850-PA I did a CLUSTALW alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi. This alignment, shown in Figure 21, shows that the rst approximately 65 amino acids of the protein are signicantly less conserved than the rest of the protein. Doing a BLAST search of the protein sequence for the grimshawi gene reveals that amino acids 95-140 align strongly to the RING superfamily of protein domains, which is a specialized zinc nger functional domain. The extra conservation in the second half of the amino acid sequence may be explained by the importance of this functional domain and the amino acids that immediately surround it. 5 Synteny Figure 22 shows a view from Flybase of the annotated genes on dot chromosome of D. melanogaster between bases 306,470 and 340,172. This represents a chunk of the of the D. melanogaster dot chromosome equal in size to grimshawi contig 12 taken so that the rst base of the start codon of CG32850 and its grimshawi ortholog are both at base 29,102 of the melanogaster chunk and the grimshawi contig respectively. This view shows genes CG2177 and CG32850 (which correspond to genscan features 12-4 and 12-3 respectively) closely adjacent to each other in the same order as in the grimshawi contig, representing synteny between the two genomes. A more detailed comparison of the distance/length features in the two genomes is given in Table 7. Overall, synteny in terms of order and orientation seems to have been preserved, but there are signicant variations in gene length and distances between the two genes. Given the signicant synteny between melanogaster and grimshawi in this region and the placement of the Rad23 and Syt7 genes in the melanogaster genome, we might expect one or both of Rad23 and Syt7 genes to appear in the grimshawi contig. To check for such a possibility, I ran TBLASTN searches for every exon of each of these genes against the entire (unmasked) grimshawi contig. Syt7 contains 11 exons. Rad23 contains 6 exons. Of these, none showed any alignment against the grimshawi contig with an e value less than.6, most had e-values greater than 1, and none of the alignments were consistent with the hypothesis that either of these genes are represented in the grimshawi contig. It is possible that these genes were either translocated out of this section of the grimshawi genome or pushed onto a dierent contig by a large number of repeats inserting between them and CG32850. 20

Figure 20: Repeat of the alignment shown in Figure 19 with the S. cerevisiae sequence removed. 21

Figure 21: CLUSTAL alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi. Figure 22: Annotated melanogaster genes between bases 306,470 and 340,172 of the dot chromosome. 22

Figure 23: Summary table of RepeatMasker output from run on D. melanogaster As an extra check, I did the same searches using exons from genes CG42314 and Hcf, which are located just distal to CG32850 in the melanogaster genome. The results were the same as for Syt7 and Rad23. Finally, to check for similarity in the number of repeats in the melanogaster and grimshawi chunks, I extracted the melanogaster region and ran RepeatMasker on it with the -species drosophila option. The results of this run are shown in Figure 23. Table 7 compares the amount and kinds of repeats (as well as the GC content) of the two chunks. The signicantly increased number of repeats in the grimshawi contig compared to the melanogaster region may support the hypothesis that Syt7 and Rad23 are still present in the grimshawi genome, but were pushed onto another contig by elongation of repetitious regions and insertions of new repetitious sequence. 6 Conclusion The nal putative annotations for this contig are given in Figure 2 and Tables 3, 5, and 7. Two genscan features were conrmed as orthologs to genes in melanogaster and two were rejected as unmasked repetitious sequence. Synteny in the two identied genes was conrmed with respect to the related region of the D. melanogaster dot chromosome, although there were signicant variations in gene lengths, distances of gene separation, repeat content, and repeat type. Deviations from the melanogaster gene structure were generally minor, except for the missing ortholog to exon 5_941 in 23

grimshawi melanogaster Dierence GC Content 41.76 33.74 8.02 Total Repeats 60.93 43.08 17.85 SINE 0 0 0 LINE 25.12 1.96 23.16 LTR 11.67 14.35-2.68 DNA 5.73 23.72-17.99 Unclassied 5.6 0.63 4.97 Satellites 12.8 0 12.8 Table 7: Comparison of repeats in melanogaster and grimshawi. the CG2177 ortholog, but since exon 5_941 is the same as exon 4_942 with a few extra bases towards the beginning, it is easy to imagine how one species might be able to tolerate missing this exon. Further work should be done on adjacent contigs to check for the Syt7 and Rad23 genes (which should be nearby if synteny is strongly preserved) as well as to see if the dierence in repeat composition between melanogaster and grimshawi is similar throughout the genomes. Finally, while it is possible that there are still unannotated genes in this contig that were missed in this search, the repeat density on this contig and the absence of other genes that were checked for in the study make that hypothesis unlikely. 7 Appendix 7.1 CG2177 Ortholog - B isoform 7.1.1.fasta le >CG2177_transcript ATGATGCTTGTCGATCAAGTGTCGCAACGCCAGACAACGGGTAGCGAAAACGATAAAAAT AT- TACGGCAACACTGGGTCTGGTCGTGCACGCGGCAGCCGATGGAGTCGCTTTGGGCGCT GCTGCCACCACCAGTCACCAGGAT- GTGGAAATTATTGTTTTCCTTGCCATAATGTTGCAC AAGGCGCCGGCCGCATTTGGTTTGGTCAGCTTTCTTCTG- CACGAGAAAGTGGAGAGGCAA CAGATACGCCGACATTTGGGCGTATTTTCGCTGTCGGCGCCATTGCTGACCCTGCT- CACA TATTTTGGCATTGGACAGGAGCAGAAGGAAACGTTGAATTCGGTGAACGCCACTGGGATT GCCATGCTTTTTTCG- GCGGGTACTTTTTTATATGTGGCAACGGTGCATGTGTTGCCCGAG TTAACGCAGGCACATCAGCACAGTGGAATGCAT- CACAAGAATGGCACTGGTTCCGGTTCC AGCACGTATGAGTATCATGCGCTGGAGGAATCACGCAGCGAAGCGGGGATTGACTCT- GCT GGGAGTGTTCAGGTTCACAGCAGCAGCAAACCAGGGCTGCTCTATGGTGAACTCATCATT ATGATCTGTGGT- GCTTTGCTGCCCCTGGTCATCACCTTTGGGCATCATCAT 7.1.2.g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS 30468 30564. + 0 gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP CDS 30632 31185. + 2 gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP stop_codon 31186 31188. +. gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP exon 30468 30564. +. gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP exon 30632 31185. +. gene_id "CG2177"; transcript_id "CG2177"; 7.1.3.pep le >CG2177_peptide MMLVDQVSQRQTTGSENDKNITATLGLVVHAAADGVALGAAATTSHQDVEIIVFLAIMLH KAPAAFGLVS- FLLHEKVERQQIRRHLGVFSLSAPLLTLLTYFGIGQEQKETLNSVNATGI AMLFSAGTFLYVATVHVLPELTQAHQHSGMHHKNGT- GSGSSTYEYHALEESRSEAGIDSA GSVQVHSSSKPGLLYGELIIMICGALLPLVITFGHHH 7.2 CG2177 Ortholog - C Isoform 7.2.1.fasta le >CG2177-RC_transcript ATGGCCGAGGAGACTATAATACTAATATTGTTGGTAATTGTGATGCTGGTTGGCTCATAT TTAGCTG- GCAGTATACCGCTGGTCATGAAACTGAGCGAGGAGAAACTAAAATGTGTGACC GTATTGGGTGCAGGTCTGCTGGTGGGCACAGCG TAACTGTCATTATACCCGAGGGCATA AGATCTCTTTATATGGATAGCAGACGACAGCAGTTGCCACAAGCAGCGGAT- GCAAGCACA ACGGGCATTTTGGTGGCGTCACCGCAAATGGACTATTCGAGAACAATTGGCTTGTCGCTT GTATTGGGCTTTGTTT 24

GATGCTTGTCGATCAAGTGTCGCAACGCCAGACAACGGGT AGCGAAAACGATAAAAATATTACGGCAACACTGGGTCTG- GTCGTGCACGCGGCAGCCGAT GGAGTCGCTTTGGGCGCTGCTGCCACCACCAGTCACCAGGATGTGGAAATTATTGTTTTC CTTGCCATAATGTTGCACAAGGCGCCGGCCGCATTTGGTTTGGTCAGCTTTCTTCTGCAC GAGAAAGTGGAGAGGCAACA- GATACGCCGACATTTGGGCGTATTTTCGCTGTCGGCGCCA TTGCTGACCCTGCTCACATATTTTGGCATTGGACAGGAGCA- GAAGGAAACGTTGAATTCG GTGAACGCCACTGGGATTGCCATGCTTTTTTCGGCGGGTACTTTTTTATATGTGGCAACG GTGCATGTGTTGCCCGAGTTAACGCAGGCACATCAGCACAGTGGAATGCATCACAAGAAT GGCACTGGTTCCGGTTCCAGCACG- TATGAGTATCATGCGCTGGAGGAATCACGCAGCGAA GCGGGGATTGACTCTGCTGGGAGTGTTCAGGTTCACAGCAGCAGCAAACC GCTC TATGGTGAACTCATCATTATGATCTGTGGTGCTTTGCTGCCCCTGGTCATCACCTTTGGG CATCATCAT 7.2.2.g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS 30081 30179. + 0 gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP CDS 30249 30564. + 0 gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP CDS 30632 31185. + 2 gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP stop_codon 31186 31188. +. gene_id "CG2177-RC"; transcript_id "CG2177- RC"; contig12 GEP exon 30081 30179. +. gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP exon 30249 30564. +. gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP exon 30632 31185. +. gene_id "CG2177-RC"; transcript_id "CG2177-RC"; 7.2.3.pep le >CG2177-RC_peptide MAEETIILILLVIVMLVGSYLAGSIPLVMKLSEEKLKCVTVLGAGLLVGTALTVIIPEGI RSLYMDSR- RQQLPQAADASTTGILVASPQMDYSRTIGLSLVLGFVFMMLVDQVSQRQTTG SENDKNITATLGLVVHAAADGVALGAAATTSHQD- VEIIVFLAIMLHKAPAAFGLVSFLLH EKVERQQIRRHLGVFSLSAPLLTLLTYFGIGQEQKETLNSVNATGIAMLFSAGTFLY- VAT VHVLPELTQAHQHSGMHHKNGTGSGSSTYEYHALEESRSEAGIDSAGSVQVHSSSKPGLL YGELIIMICGALLPLVIT- FGHHH 7.3 CG32850 Ortholog 7.3.1.fasta le >CG32850-PA_transcript ATGGGTAATTGCTTGAAAATGAGCAGTCCAGATGACATTTCACTTTTGCGAGGCAGCGAT AGCATCATTAGTGCACAGGACAATGGACCAATGCCAATTTATCAGCAGGAGCCGATGCCA CAGCTGTTCTATCAAACG- GTCAGTGGCAATACATCTGGCAACGCTGTCGCCGCTGCCACT CACATGTCCGAAGAGGATCAGATAAAAATAGCAAAGCG- CATTGGATTAGTTCAACATTTG CCGATTGGCACGTATGACAGCAACTCAAAGAAAGCAGCACGCGAATGCGTCATTTG- TATG GTGGAATTTAGCAACGAGGAAGCCGTTCGCTATTTGCCCTGCATGCACATTTATCATGTG AACTGCATCGAC- GATTGGCTAATGCGTAGTTTAACCTGCCCCAGTTGCTTGGAACCGGTG GATGCGGCTCTACTCACTAGCTATGAGA- CAACA 7.3.2.g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS 28998 29102. - 0 gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP CDS 28598 28945. - 0 gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP stop_codon 28595 28597. -. gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP exon 28998 29102. -. gene_id "CG32850-PA"; transcript_id "CG32850- PA"; contig12 GEP exon 28598 28945. -. gene_id "CG32850-PA"; transcript_id "CG32850-PA"; 7.3.3.pep le >CG32850-PA_peptide MGNCLKMSSPDDISLLRGSDSIISAQDNGPMPIYQQEPMPQLFYQTVSGNTSGNAVAAAT HM- SEEDQIKIAKRIGLVQHLPIGTYDSNSKKAARECVICMVEFSNEEAVRYLPCMHIYHV NCIDDWLMRSLTCPSCLEPVDAALLT- SYETT 25