Annotation of Drosophila grimashawi Contig12

Size: px
Start display at page:

Download "Annotation of Drosophila grimashawi Contig12"

Transcription

1 Annotation of Drosophila grimashawi Contig12 Marshall Strother April 27, 2009 Contents 1 Overview 3 2 Genes Genscan Feature Genome Browser: First Look BLASTP Flybase Gene Record Finder TBLASTN Genome Browser for Intron/Exon Boundaries Gene Checker Genscan Feature Genome Browser: First Look BLASTP Flybase Gene Record Finder TBLASTN Genome Browser Gene Checker Genscan Feature Genome Browser: First Look BLASTP BLASTX BLAT BLASTX: Predicted Exon Only Genome Browser BLASTP CLUSTALW: Predicted Exon Only Conclusion Genscan Feature Genome Browser: First Look BLASTP BLASTX BLAT Repeats 16 4 CLUSTALW CG2177-PC First Alignment Alignment without S. cerevisiae CG32850-PA Synteny 20 6 Conclusion 23 1

2 7 Appendix CG2177 Ortholog - B isoform fasta le g le pep le CG2177 Ortholog - C Isoform fasta le g le pep le CG32850 Ortholog fasta le g le pep le List of Figures 1 Initial genome browser view Map of nal annotation BLASTP of Genscan feature 12.4 against D. melanogaster annotated protein database CG2177 isoforms in melanogaster The exons and isoforms of CG2177 mapped onto the D. grimshawi contig. As discussed in Section 2.1.4, the rst 13 amino acids of exon 5_941 (represented here by a red bar) do not align between D. grimshawi and D. melanogaster. Coupled with data from the Genome Broswer (Section 2.1.5), this suggests that that the A isoform has no ortholog in D. grimshawi The beginning of the annotated CG2177 ortholog. The start codon in frame 3 is the true start to the gene (orthologous to exon 4_942). The two stop codons in the same frame and the lack of a start codon rule out the presence of an exon orthologous to 5_ BLASTP results for GENSCAN feature Possible splice acceptors for exon orthologous to 3_959. The boundary that was ultimately annotated is represented by the second red box from the left Text results of BLASTX search of section of contig12 relevant to Genscan feature 12.2 to NCBI NR Database BLAT results of region of contig 12 relevant to Genscan feature Example of BLAT browser view for one of the results shown in Figure Results of the BLASTX search for the exonic sequence of Genscan Feature Genome browser view showing the region around Genscan feature BLASTP against the NCBI NR database of the amino acid sequence of the predicted near Genscan feature 12-2 in frame CLUSTALW alignment of proposed protein sequence related to Genscan feature 12-2 and protein sequences from top ve BLASTX hit (see Sections and 12) BLAT results of region of contig 12 relevant to Genscan feature Example browser views for one of the results shown in Figure Summary and detailed Repeat Masker output CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae Repeat of the alignment shown in Figure 19 with the S. cerevisiae sequence removed CLUSTAL alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi Annotated melanogaster genes between bases 306,470 and 340,172 of the dot chromosome Summary table of RepeatMasker output from run on D. melanogaster List of Tables 1 Summary of annotations. Feature regions extend from the predicted promoter to the predicted polyadenylation site. Figure 2 shows only the exons of the predicted features

3 Figure 1: Initial genome browser view Feature Region Putative Annotation Genscan ,593 to 15,160 Unmasked repeat Genscan ,554 to 15,455 Unknown coding gene Genscan ,314 to 25,609 CG32850 ortholog Genscan ,312 to 29,479 CG2177 ortholog Table 1: Summary of annotations. Feature regions extend from the predicted promoter to the predicted polyadenylation site. Figure 2 shows only the exons of the predicted features. 2 TBLASTN results for CG Gene model corresponding to Genscan feature Note that 5_941 has no ortholog in grimshawi TBLASTN results for CG Model for CG32850 ortholog Synteny comparisons of melanogaster and grimshawi chunks.. Lengths of genes are given from start codon to stop codon. Distances between genes are given from start codon to start codon Comparison of repeats in melanogaster and grimshawi Overview In this paper, I present the annotation of the newly nished contig12 of the Drosophila grimshawi dot chromosome. This annotation was done cheiy by using comparative genomic methods to conrm or refute predictions made by the de novo gene predictor, Genscan. Synteny comparisons with D. melanogaster were also taken into account and are discussed. Genscan's predictions generally coincided with those of other de novo gene predictors (see Figure 1). The nal annotation (shown in Figure 2) includes two genes, CG32850 and CG2177 which correspond to Genscan features 12.3 and 12.4 and can be found in the contig at bases 29,314-25,609 and 33,312-29,479 respectively. Multi-sequence alignment of these genes to orthologous genes in other species was performed using CLUSTALW to gain insight on their evolutionary history. The other two genscan-predicted features were annotated as repetitious sequence left unmasked by preprocessing with RepeatMasker. These features contribute to the overall high repeat density of the contig (60.93% not including the features). 2 Genes 2.1 Genscan Feature 12.4 My rst signicant annotation began with an examination of Genscan feature The procedure described below is quite typical of a straightforward annotation of a highly probable ortholog. 3

4 Figure 2: Map of nal annotation 4

5 Figure 3: BLASTP of Genscan feature 12.4 against D. melanogaster annotated protein database Genome Browser: First Look I began by looking at the genome browser to get a general impression of the feature and what to expect. As shown in Figure 1, the initial genome browser view, feature 12.4 aligns quite strongly with predictions from all of the other gene predictors as well as the BLASTX alignment to D. melanogaster proteins and several other sources of Drosophila genomic data BLASTP I then obtained the sequence of the predicted Genscan feature from the provided output and used it as the query in a BLASTP search against the D. melanogaster annotated protein database. The best-matching gene from D. melanogaster would be examined in subsequent steps for homology. The results for feature 12.4 (Figure 3) showed strong similarity to the D. melanogaster protein CG2177 gene Flybase Gene Record Finder I began conrming homology by looking up the gene in Flybase and obtaining the sequences of all of the gene's exons from the CDS Translations sections. In the case of CG2177, there are several isoforms, all of which are combinations of the exons CDS_FBgn :1_941, CDS_FBgn :2_941, CDS_FBgn :4_942, CDS_FBgn :5_941, and CDS_FBgn :2_942. From here on, these exons will be referred to with everything before the colon truncated for simplicity. All three of these 5

6 Figure 4: CG2177 isoforms in melanogaster Exon Length (amino acids) Aligned Region in Exon Aligned Region in contig Frame 1_ to ,634-31, _ to 95 30,249-30, _ to 33 30,081-30, _ to 35 30,081-30, _ to 32 30,468-30,563 3 Table 2: TBLASTN results for CG2177 isoforms (CG2177-RA 1, CG2177-RB, and CG2177-RC ) are conrmed in melanogaster according to the record for this gene in Flybase. The Gene Record Finder also contains links to the Flybase page discussing the gene's function. The CG2177 protein has activity as a metal-ion trans-membrane transporter, which is not conrmed in vivo, but is predicted from InterPro electronic annotation. It has only one known allele TBLASTN I then used TBLASTN to systematically align each exon with the entire masked contig. Each search typically returned exactly one plausible alignment. I noted the ends of the alignment on the contig as putative exon boundaries, as well as the frame for the alignment. (See Table 2) If a signicant section of the query was missing on either end of the alignment (e.g the rst or last 10 amino acids), this was also noted. The entire length of the query (exon) aligned to the subject (nucleotides) for all exons except for 5_941. In 5_941, the rst 12 amino acids of the 45 amino acids do not align. Interestingly, exons 5_941 and 4_942 are exactly the same except for the rst 12 amino acids of exon 5_941. Increasing the expect value threshold by two orders of magnitude fails to give any alignments including the rst 12 amino acids for exon 5_941 in the same frame. This makes it very likely that exon 5_941 in D. melanogaster does not have an ortholog in D. grimshawi. This will be almost certainly conrmed if there is no start codon approximately 12 amino acids before the putative ortholog to 4_942, as shown in the genome browser Genome Browser for Intron/Exon Boundaries Using the putative exons identied in the TBLASTN search as a guide, I returned to the genome browser to locate the exact sites of the gene's start codon, stop codon, and intron/exon boundaries. At the same time, I could also observe how well this gene model corresponded to the original gene model predicted by Genscan. An example of views used to identify exon boundaries in of the CG2177 gene is shown in Figure 6. Note that this includes the beginning of exon 5_941, which required some special consideration because the rst 12 amino acids did not align with D. melanogaster. However, there are no other possible start codons anywhere within a reasonable distance of the alignment. There are also several in-frame stop codons. Missing exons between species is quite rare, however the missing and presence of start codons is very strong evidence against the presence of exon 5_941 in D. grimshawi, so I will not include an ortholog to 5_941 in my nal annotation. The nal putative gene model for CG2177 in D. grimshawi is shown in table 3. This corresponds very well with the Genscan prediction as well as the TBLASTN alignments. 1 Note: RA, RB, RC, etc. refer to the processed mrna transcripts of each of the isoforms of a gene, whereas PA, PB, and PC refer to the resulting translated proteins. When the distinction is relevant, an eort has been made to use the appropriate label, however R vs. P labeling is often used interchangeably. 6

7 Figure 5: The exons and isoforms of CG2177 mapped onto the D. grimshawi contig. As discussed in Section 2.1.4, the rst 13 amino acids of exon 5_941 (represented here by a red bar) do not align between D. grimshawi and D. melanogaster. Coupled with data from the Genome Broswer (Section 2.1.5), this suggests that that the A isoform has no ortholog in D. grimshawi. Figure 6: The beginning of the annotated CG2177 ortholog. The start codon in frame 3 is the true start to the gene (orthologous to exon 4_942). The two stop codons in the same frame and the lack of a start codon rule out the presence of an exon orthologous to 5_941 Dmel CG2177 Ortholog Frame First bp after splice acceptor Start Codon Last bp before splice donor Stop codon 1_ ,632 X X 31,186-31,188 2_ ,249 X 30,564 X 4_942 3 X 30,081-30,083 30,179 X 5_941 3 X X X X 2_942 3 X 30,468-30,563 30,564 X Table 3: Gene model corresponding to Genscan feature Note that 5_941 has no ortholog in grimshawi. 7

8 Figure 7: BLASTP results for GENSCAN feature Gene Checker Finally, having come up with boundaries for each exon, I constructed gene models for each isoform of the protein and ran each through the gene checker (one check for each isoform), which checks that the gene model obeys basic biological rules for plausible genes. Since exon 5_941 is predicted to be missing in D. grimshawi and is necessary for the A isoform, I only ran the gene model checker for the isoforms predicted to be orthologous to the B and C isoforms. Both models passed. 2.2 Genscan Feature Genome Browser: First Look As shown in Figure 1 (the initial genome browser view) as with feature 12.3, feature 12.4 aligns quite strongly with predictions from all of the other gene predictors as well as the BLASTX alignment to D. melanogaster proteins and several other sources of Drosophila genomic data BLASTP Running the same BLASTP search as described in section gives the results shown in Figure 7. The best match is to the D. melanogaster CG32850 gene (PA isoform) by over 100 points and over 46 orders of magnitude. Matches to other genes also have e values of much less than 10 4, so these warrant investigation, but they all align in a region between amino acids 90 and 130 in the query, so these matches may indicate an orthologous protein functional motif that is present in many dierent genes rather than a gene-gene orthology relationship. Looking up the function of the CG32850 gene in Entrez reveals that it has both protein binding and zinc-ion binding functions. No part of the gene is annotated as a functional motif, but a cursory examination of the Entrez gene records for several of the weaker matches in the TBLASTN search shows that all genes sampled also have protein-binding and zinc-ion binding functions, which supports the functional motif hypothesis. 8

9 Exon Length (amino acids) Aligned Region in Exon Aligned Region in Contig Frame 2_ to 31 29,102 to 28, _ to ,924 to 28,595-3 Table 4: TBLASTN results for CG32850 Figure 8: Possible splice acceptors for exon orthologous to 3_959. represented by the second red box from the left. The boundary that was ultimately annotated is Flybase Gene Record Finder According to the Gene Record Finder, CG32850 has two coding exons and a single isoform. The exons are called CDS_CG32850:2_959 and CDS_CG32850:3_959 (hereafter referred to as exons 2_959 and 3_959 respectively). They appear in the fully translated gene in order of their numbering. All exons are on the + strand of melanogaster chromosome 4. There is no additional information in Flybase about the function of the gene or its translated protein or its number of observed alleles TBLASTN The results of TBLASTN searches similar to those performed in section are shown in table Genome Browser Exon 2_959 A putative exon orthologous to exon 2_959 was found in the genome browser that perfectly matched the prediction from the TBLASTN alignment. In frame -2 there is a start codon at 29,102-29,100 and a GT intron donor site from 28,997-28,996 (phase 0). Exon 3_959 There is a stop codon in frame -3 at 28,597-28,595, which matches the TBLASTN prediction for exon 3_595 perfectly. Identication of a putative splice acceptor site was slightly more dicult. There were ve possible sites that were all in phase 0 relative to frame -3. These sites corresponded to the rst base of the exon being at 28,927, 28,939, 28,942, 28,945, and 28,966. See Figure 8 The rst of these puts the beginning of the exon closest to the 9

10 Dmel CG32850 Ortholog Frame First bp after splice acceptor Start Codon Last bp before splice donor Stop codon 2_959-2 X 29,102-29,100 28,998 X 3_ ,945 X X 28,597-28,595 Table 5: Model for CG32850 ortholog. beginning of the alignment to melanogaster exon 3_959. However, the alignment did not include 8 amino acids from the query, which lead me to believe that the splice site was actually at a higher base number. There is a stop codon in frame -3 beginning at base 28,948, so the splice site could not be at any base number higher than that. These two lines of evidence lead me to choose the splice site composed of bases 28,947 and 28,946. Further evidence for this splice site is supplied by the genome browser (which labels it as a high-likelihood acceptor) and by the fact that this splice site is only 3 bases downstream from where we would expect to nd it if all 8 amino acids missing in the alignment were included in the exon Gene Checker The nal model for the grimshawi putative ortholog to CG32850 is shown in table 5. It passes the gene checker. 2.3 Genscan Feature Genome Browser: First Look Genscan feature 12-2 is a predicted single exon gene in the middle of two long stretches of repetitious sequence. Single exon genes are not common, so this piece of evidence alone suggests that it may be a mis-prediction by Genscan. Its location in the middle of repetitious sequence may suggest that it is actually repetitious sequence missed by RepeatMasker. Finally, Genscan is the only gene predictor that predicted a gene in this region, which is evidence in support of this feature being a miscall by Genscan BLASTP A BLASTP search with an e value threshold of 1 of the Genscan-predicted amino-acid sequence for feature 12-2 reveals no signicant matches to D. melanogaster or any of the other species available in FlyBase. (Done in two searches: one against only D. melanogaster, the other against all species.) A search against all species with the low-complexity lter turned o returns the same results BLASTX To collect further evidence that this is not a signicant feature, I extracted the entire region of the contig from the beginning to the end of Genscan feature 12-2 (bases to 25554) using the EMBOSS tool extractseq on the gep server. I then ran a BLASTX search against the NCBI NR database using this extracted region as the query. This search returns a large number of extremely signicant alignments (ranging from 1e-162 to 1e-30 in the rst 30 hits) to reverse transcriptase genes in many dierent Drosophila species. (See Figure 9) It also has matches of similar strength to predicted refseq proteins, but coupled with this feature's proximity to large repetitious elements as seen in the genome browser (see Figure 1), it is highly likely that this feature is the result of repetitious sequence that RepeatMasker failed to mask BLAT To further conrm that this feature is due to repetitious sequence rather than a true gene, I did a BLAT search of the extracted region against the entire grimshawi genome on the UCSC server. The results of this search are shown in Figure 10. As we would expect for repetitious sequence, the region aligns to many dierence places in the genome with a high level of similarity along the entire length of the query. Theoretically, this could still represent a large gene family with many dierent members throughout the genome. To rule out this possibility, I opened the genome browser view for a sampling of the top alignments in the BLAT search. An example of one of these views is shown in Figure 11. There were no alignments to any melanogaster proteins in any of these views, which provides further evidence that this feature is repetitious non-protein coding sequence. 10

11 Figure 9: Text results of BLASTX search of section of contig12 relevant to Genscan feature 12.2 to NCBI NR Database 11

12 Figure 10: BLAT results of region of contig 12 relevant to Genscan feature 12.2 Figure 11: Example of BLAT browser view for one of the results shown in Figure 10 12

13 Figure 12: Results of the BLASTX search for the exonic sequence of Genscan Feature BLASTX: Predicted Exon Only Genscan predicts that this feature has only one exon, which occurs between bases 25,321 and 25,073 inclusive. The polyadenylation signal for this gene, however, begins at base 15,455. There is a signicant amount of masked repetitious sequence between base 25,073 and base 15,455. Since I included every base between the beginning of the promotor and the end of the polya signal in the extracted region that I used for the above BLASTX and BLAT searches, the results of these searches could be signicantly confounded by the repetitious sequence. I therefore repeated the BLASTX search using only the predicted exonic region (base 25,321 to base 25,073). The results of the BLASTX search are shown in Figure 12. There are four fairly strong (e ) alignments to predicted proteins that extend between base 60 and base 235 of the query and three more alignments to predicted protiens in around the same region that are slightly less strong (e ). Looking at the top six alignments from the BLASTX search, it is clear that there is a stretch of 29 amino acids (starting around translated base base 149 and ending around base 62) that is particularly well conserved and is almost identical in the query and all of the predicted proteins. This conservation may be indicative of function, which would suggest that this sequence is actually transcribed and translated. (This is further discussed in Section 2.3.8). It is interesting to note, however, that all of the alignments occur in frame -2 of the contig (which corresponds to frame -3 of the BLASTX query), while the exon of Genscan feature 12-2 is predicted to occur in frame -3 of the contig. 13

14 Figure 13: Genome browser view showing the region around Genscan feature 12.2 Figure 14: BLASTP against the NCBI NR database of the amino acid sequence of the predicted near Genscan feature 12-2 in frame Genome Browser A view of the Genome Browser around Genscan feature 12-2 is shown in Figure 13. There are start codons very close together in both frame -2 and frame -3 (the frames of the BLASTX alignments and the Genscan prediction respectively). The rst stop codon after the start codon in frame -2 can also be seen and is reasonably close the to stop codon associated with the Genscan prediction. From the BLASTP and BLASTX searches, it is unlikely that the exon predicted by Genscan is actually translated. However, it is possible that the there is a real exon in frame -2. Under this hypothesis, the presence of start and stop codons in frame -2 is consistent with the results of the BLASTX alignment discussed in Section BLASTP To conrm this new hypothesis, that the real translated exon extends from base 25,328 to base 25,135 of the contig in frame -2, I used the EMBOSS toolset to extract the region and translate it in the appropriate frame. I then used the translated sequence as the query in a BLASTP search against the NCBI NR database. The results, shown in Figure 14, look very similar to the results of the exon-only BLASTX alignments CLUSTALW: Predicted Exon Only As discussed in Section 2.3.5, one region in particular was similar between the query sequence and all of the aligned proteins in the BLASTX search described in the same section. To get a more precise illustration of this similarity, I obtained the protein sequences from the top 5 hits and aligned them using CLUSTALW with the predicted protein sequence of Genscan feature (I excluded the sixth best BLASTX hit because it came from the same species as the fth and represented an extremely similar predicted protein.) The results of the alignment, shown in Figure 15, conrm that one end of the protein is signicantly more conserved than the other Conclusion The evidence of dierential cross-species conservation revealed by the CLUSTALW and BLASTX alignments discussed above are enough for me to conclude that the region between base 23,328 and base 25,135 is translated in frame -2. The hypothesis that this feature is actually unmasked repetitious sequence I now believe to be incorrect. The evidence that supports such a hypothesis can be largely explained by a dearth of conrmed similar proteins and the inclusion of a large amount of repetitious sequence in my earlier searches. 14

15 Figure 15: CLUSTALW alignment of proposed protein sequence related to Genscan feature 12-2 and protein sequences from top ve BLASTX hit (see Sections and 12). 15

16 Since there were no alignments to known proteins or domains in any of the searches, I am unable to make any prediction about this protein's function at this time. Since I could not identify an ortholog in D. melanogaster I am unable to run the predicted gene through the Gene Model Checker. 2.4 Genscan Feature Genome Browser: First Look Genscan feature 12.1 is in the same region as predictions from several other gene predictors. It also aligns with a stretch of the X chromosome from D. melanogaster, so if it does turn out to be a true orthologous gene, it may represent an event where a gene has been translocated between two chromosomes. Finally, like feature 12.2, it is found in the middle of repetitious elements, so it may turn out to be another unmasked repetitious sequence BLASTP The same BLASTP searches as performed for feature 12.2 (section 2.3.2) returned no signicant hits BLASTX I then performed the same BLASTX searches as I did for feature 12.2 (section 2.3.3) using the extracted region built from the Genscan prediction (bases 12,593 to 15,160). This time there were many very strong hits to predicted genes, the vast majority of which cluster in the rst 1200 bases of the extracted sequence, but no hits to the reverse transcriptase genes seen in feature There were no hits to any conrmed genes. Unlike Genscan feature 12.2, the beginning of the promotor of Genscan feature 12.1 and the end of the polya sequence form fairly tight bounds around the predicted exonic sequences of feature 12.1, which makes this kind of BLASTX search (and the following BLAT search) more informative in this case than it was in the case of Genscan feature BLAT I then performed the same BLAT search for this feature as for feature 12.2 (see section 2.3.4). The results (shown in Figure 16 ) were similar and showed a large number of hits with a high percentage of identity to the query (89-95%). About 20% of these hits spanned almost the entire query sequence, and the remaining 80% aligned to either the rst half of the query or the second half of the query. As before, these results are consistent with the hypothesis that feature 12.1 represents unmasked repetitious sequence. As before, it is still theoretically possible that this could represent an extremely large gene family, so I looked at the genome browser view for some of the matches. Some showed results that are extremely implausible for a real gene (e.g. alignments do not correspond to gene predictions or comparative genomics tracks, implausibly large and frequen introns), and others show extremely strong alignments to RepeatMasked sequence (see Figure 17), both of which further support my hypothesis that this feature represents unmasked repeat. 3 Repeats According to RepeatMasker, contig12 is 60.93% repeat, although, as discussed in sections 2.3 and 2.4 above, I believe that the actual percentage is signicantly more. See Figure 18 for the full table of results. 4 CLUSTALW 4.1 CG2177-PC First Alignment Since the alignment of the D. grimshawi gene to the D. melanogaster gene was so strong (on the order of 10 92, see Figure 3) and there were immediately available putative orthologs for many distantly related species, I decided to do a CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae. This alignment, shown in Figure 19, showed relatively little conservation of the amino acid sequence of this gene, though the region corresponding to the grimshawi amino acids number 135 to 253 showed signicantly more conservation than the rest of the protein. 16

17 Figure 16: BLAT results of region of contig 12 relevant to Genscan feature 12.1 Figure 17: Example browser views for one of the results shown in Figure 16 17

18 Figure 18: Summary and detailed Repeat Masker output 18

19 Figure 19: CLUSTALW alignment of the grimshawi CG2177-PC protein ortholog sequence with the putative orthologous sequences from D. melanogaster, H. sapiens, C. elegans, and S. cerevisiae. 19

20 melanogaster grimshawi Length of CG2177 1, Length of CG , Distance between the two genes 5, Table 6: Synteny comparisons of melanogaster and grimshawi chunks.. Lengths of genes are given from start codon to stop codon. Distances between genes are given from start codon to start codon. One explanation for such conservation is the presence of a conserved functional domain. However, a BLAST search of the protein sequence reveals that that only strong alignments to conserved functional domains are to the ZIP superfamily in the region of amino acids 51 to 156 in the grimshawi sequence. While the small overlap between the conserved sequence and the alignment to the ZIP superfamily domain may be signicant, it is unlikely that this explains the extra conservation entirely. The conserved region likely serves some important function in the protein, but as of now it is impossible to tell exactly what Alignment without S. cerevisiae Looking at the output from the rst alignment, it is clear that there is signicantly more similarity between the H. sapiens, C. elegans, D. melanogaster, and D. grimshawi proteins than there is between any one of those proteins and the S. cerevisiae protein. I therefore excluded S. cerevisiae and repeated the alignment. The results are shown in Figure 20. In this alignment, there is a region that is somewhat conserved between amino acid 10 and amino acid 60 of the D. grimshawi protein. There are also four regions of very strict conservation between amino acids 96 and 247 of the D. grimshawi protein. Again, very little conservation is seen in the region identied as resembling the ZIP superfamily domain, which leads me to belive that the similarity to ZIP is coincidental and not indicative of this protein's function. It is likely that the protein's function is governed by the amino acids that are highly conserved between the species, but I am not able to generate any hypothesis about what that function might be through standard comparative genomics methods CG32850-PA I did a CLUSTALW alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi. This alignment, shown in Figure 21, shows that the rst approximately 65 amino acids of the protein are signicantly less conserved than the rest of the protein. Doing a BLAST search of the protein sequence for the grimshawi gene reveals that amino acids align strongly to the RING superfamily of protein domains, which is a specialized zinc nger functional domain. The extra conservation in the second half of the amino acid sequence may be explained by the importance of this functional domain and the amino acids that immediately surround it. 5 Synteny Figure 22 shows a view from Flybase of the annotated genes on dot chromosome of D. melanogaster between bases 306,470 and 340,172. This represents a chunk of the of the D. melanogaster dot chromosome equal in size to grimshawi contig 12 taken so that the rst base of the start codon of CG32850 and its grimshawi ortholog are both at base 29,102 of the melanogaster chunk and the grimshawi contig respectively. This view shows genes CG2177 and CG32850 (which correspond to genscan features 12-4 and 12-3 respectively) closely adjacent to each other in the same order as in the grimshawi contig, representing synteny between the two genomes. A more detailed comparison of the distance/length features in the two genomes is given in Table 7. Overall, synteny in terms of order and orientation seems to have been preserved, but there are signicant variations in gene length and distances between the two genes. Given the signicant synteny between melanogaster and grimshawi in this region and the placement of the Rad23 and Syt7 genes in the melanogaster genome, we might expect one or both of Rad23 and Syt7 genes to appear in the grimshawi contig. To check for such a possibility, I ran TBLASTN searches for every exon of each of these genes against the entire (unmasked) grimshawi contig. Syt7 contains 11 exons. Rad23 contains 6 exons. Of these, none showed any alignment against the grimshawi contig with an e value less than.6, most had e-values greater than 1, and none of the alignments were consistent with the hypothesis that either of these genes are represented in the grimshawi contig. It is possible that these genes were either translocated out of this section of the grimshawi genome or pushed onto a dierent contig by a large number of repeats inserting between them and CG

21 Figure 20: Repeat of the alignment shown in Figure 19 with the S. cerevisiae sequence removed. 21

22 Figure 21: CLUSTAL alignment for the CG32850-PA protein and using the orthologs from mojavensis, virilis, pseudoobscura, melanogaster, and grimshawi. Figure 22: Annotated melanogaster genes between bases 306,470 and 340,172 of the dot chromosome. 22

23 Figure 23: Summary table of RepeatMasker output from run on D. melanogaster As an extra check, I did the same searches using exons from genes CG42314 and Hcf, which are located just distal to CG32850 in the melanogaster genome. The results were the same as for Syt7 and Rad23. Finally, to check for similarity in the number of repeats in the melanogaster and grimshawi chunks, I extracted the melanogaster region and ran RepeatMasker on it with the -species drosophila option. The results of this run are shown in Figure 23. Table 7 compares the amount and kinds of repeats (as well as the GC content) of the two chunks. The signicantly increased number of repeats in the grimshawi contig compared to the melanogaster region may support the hypothesis that Syt7 and Rad23 are still present in the grimshawi genome, but were pushed onto another contig by elongation of repetitious regions and insertions of new repetitious sequence. 6 Conclusion The nal putative annotations for this contig are given in Figure 2 and Tables 3, 5, and 7. Two genscan features were conrmed as orthologs to genes in melanogaster and two were rejected as unmasked repetitious sequence. Synteny in the two identied genes was conrmed with respect to the related region of the D. melanogaster dot chromosome, although there were signicant variations in gene lengths, distances of gene separation, repeat content, and repeat type. Deviations from the melanogaster gene structure were generally minor, except for the missing ortholog to exon 5_941 in 23

24 grimshawi melanogaster Dierence GC Content Total Repeats SINE LINE LTR DNA Unclassied Satellites Table 7: Comparison of repeats in melanogaster and grimshawi. the CG2177 ortholog, but since exon 5_941 is the same as exon 4_942 with a few extra bases towards the beginning, it is easy to imagine how one species might be able to tolerate missing this exon. Further work should be done on adjacent contigs to check for the Syt7 and Rad23 genes (which should be nearby if synteny is strongly preserved) as well as to see if the dierence in repeat composition between melanogaster and grimshawi is similar throughout the genomes. Finally, while it is possible that there are still unannotated genes in this contig that were missed in this search, the repeat density on this contig and the absence of other genes that were checked for in the study make that hypothesis unlikely. 7 Appendix 7.1 CG2177 Ortholog - B isoform fasta le >CG2177_transcript ATGATGCTTGTCGATCAAGTGTCGCAACGCCAGACAACGGGTAGCGAAAACGATAAAAAT AT- TACGGCAACACTGGGTCTGGTCGTGCACGCGGCAGCCGATGGAGTCGCTTTGGGCGCT GCTGCCACCACCAGTCACCAGGAT- GTGGAAATTATTGTTTTCCTTGCCATAATGTTGCAC AAGGCGCCGGCCGCATTTGGTTTGGTCAGCTTTCTTCTG- CACGAGAAAGTGGAGAGGCAA CAGATACGCCGACATTTGGGCGTATTTTCGCTGTCGGCGCCATTGCTGACCCTGCT- CACA TATTTTGGCATTGGACAGGAGCAGAAGGAAACGTTGAATTCGGTGAACGCCACTGGGATT GCCATGCTTTTTTCG- GCGGGTACTTTTTTATATGTGGCAACGGTGCATGTGTTGCCCGAG TTAACGCAGGCACATCAGCACAGTGGAATGCAT- CACAAGAATGGCACTGGTTCCGGTTCC AGCACGTATGAGTATCATGCGCTGGAGGAATCACGCAGCGAAGCGGGGATTGACTCT- GCT GGGAGTGTTCAGGTTCACAGCAGCAGCAAACCAGGGCTGCTCTATGGTGAACTCATCATT ATGATCTGTGGT- GCTTTGCTGCCCCTGGTCATCACCTTTGGGCATCATCAT g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP CDS gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP stop_codon gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP exon gene_id "CG2177"; transcript_id "CG2177"; contig12 GEP exon gene_id "CG2177"; transcript_id "CG2177"; pep le >CG2177_peptide MMLVDQVSQRQTTGSENDKNITATLGLVVHAAADGVALGAAATTSHQDVEIIVFLAIMLH KAPAAFGLVS- FLLHEKVERQQIRRHLGVFSLSAPLLTLLTYFGIGQEQKETLNSVNATGI AMLFSAGTFLYVATVHVLPELTQAHQHSGMHHKNGT- GSGSSTYEYHALEESRSEAGIDSA GSVQVHSSSKPGLLYGELIIMICGALLPLVITFGHHH 7.2 CG2177 Ortholog - C Isoform fasta le >CG2177-RC_transcript ATGGCCGAGGAGACTATAATACTAATATTGTTGGTAATTGTGATGCTGGTTGGCTCATAT TTAGCTG- GCAGTATACCGCTGGTCATGAAACTGAGCGAGGAGAAACTAAAATGTGTGACC GTATTGGGTGCAGGTCTGCTGGTGGGCACAGCG TAACTGTCATTATACCCGAGGGCATA AGATCTCTTTATATGGATAGCAGACGACAGCAGTTGCCACAAGCAGCGGAT- GCAAGCACA ACGGGCATTTTGGTGGCGTCACCGCAAATGGACTATTCGAGAACAATTGGCTTGTCGCTT GTATTGGGCTTTGTTT 24

25 GATGCTTGTCGATCAAGTGTCGCAACGCCAGACAACGGGT AGCGAAAACGATAAAAATATTACGGCAACACTGGGTCTG- GTCGTGCACGCGGCAGCCGAT GGAGTCGCTTTGGGCGCTGCTGCCACCACCAGTCACCAGGATGTGGAAATTATTGTTTTC CTTGCCATAATGTTGCACAAGGCGCCGGCCGCATTTGGTTTGGTCAGCTTTCTTCTGCAC GAGAAAGTGGAGAGGCAACA- GATACGCCGACATTTGGGCGTATTTTCGCTGTCGGCGCCA TTGCTGACCCTGCTCACATATTTTGGCATTGGACAGGAGCA- GAAGGAAACGTTGAATTCG GTGAACGCCACTGGGATTGCCATGCTTTTTTCGGCGGGTACTTTTTTATATGTGGCAACG GTGCATGTGTTGCCCGAGTTAACGCAGGCACATCAGCACAGTGGAATGCATCACAAGAAT GGCACTGGTTCCGGTTCCAGCACG- TATGAGTATCATGCGCTGGAGGAATCACGCAGCGAA GCGGGGATTGACTCTGCTGGGAGTGTTCAGGTTCACAGCAGCAGCAAACC GCTC TATGGTGAACTCATCATTATGATCTGTGGTGCTTTGCTGCCCCTGGTCATCACCTTTGGG CATCATCAT g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP CDS gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP CDS gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP stop_codon gene_id "CG2177-RC"; transcript_id "CG2177- RC"; contig12 GEP exon gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP exon gene_id "CG2177-RC"; transcript_id "CG2177-RC"; contig12 GEP exon gene_id "CG2177-RC"; transcript_id "CG2177-RC"; pep le >CG2177-RC_peptide MAEETIILILLVIVMLVGSYLAGSIPLVMKLSEEKLKCVTVLGAGLLVGTALTVIIPEGI RSLYMDSR- RQQLPQAADASTTGILVASPQMDYSRTIGLSLVLGFVFMMLVDQVSQRQTTG SENDKNITATLGLVVHAAADGVALGAAATTSHQD- VEIIVFLAIMLHKAPAAFGLVSFLLH EKVERQQIRRHLGVFSLSAPLLTLLTYFGIGQEQKETLNSVNATGIAMLFSAGTFLY- VAT VHVLPELTQAHQHSGMHHKNGTGSGSSTYEYHALEESRSEAGIDSAGSVQVHSSSKPGLL YGELIIMICGALLPLVIT- FGHHH 7.3 CG32850 Ortholog fasta le >CG32850-PA_transcript ATGGGTAATTGCTTGAAAATGAGCAGTCCAGATGACATTTCACTTTTGCGAGGCAGCGAT AGCATCATTAGTGCACAGGACAATGGACCAATGCCAATTTATCAGCAGGAGCCGATGCCA CAGCTGTTCTATCAAACG- GTCAGTGGCAATACATCTGGCAACGCTGTCGCCGCTGCCACT CACATGTCCGAAGAGGATCAGATAAAAATAGCAAAGCG- CATTGGATTAGTTCAACATTTG CCGATTGGCACGTATGACAGCAACTCAAAGAAAGCAGCACGCGAATGCGTCATTTG- TATG GTGGAATTTAGCAACGAGGAAGCCGTTCGCTATTTGCCCTGCATGCACATTTATCATGTG AACTGCATCGAC- GATTGGCTAATGCGTAGTTTAACCTGCCCCAGTTGCTTGGAACCGGTG GATGCGGCTCTACTCACTAGCTATGAGA- CAACA g le track name="custommodel" description="custom Gene Model" color=200,0,0 visibility=2 contig12 GEP CDS gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP CDS gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP stop_codon gene_id "CG32850-PA"; transcript_id "CG32850-PA"; contig12 GEP exon gene_id "CG32850-PA"; transcript_id "CG PA"; contig12 GEP exon gene_id "CG32850-PA"; transcript_id "CG32850-PA"; pep le >CG32850-PA_peptide MGNCLKMSSPDDISLLRGSDSIISAQDNGPMPIYQQEPMPQLFYQTVSGNTSGNAVAAAT HM- SEEDQIKIAKRIGLVQHLPIGTYDSNSKKAARECVICMVEFSNEEAVRYLPCMHIYHV NCIDDWLMRSLTCPSCLEPVDAALLT- SYETT 25

GEP Annotation Report

GEP Annotation Report GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name:

More information

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb

Tandem repeat 16,225 20,284. 0kb 5kb 10kb 15kb 20kb 25kb 30kb 35kb Overview Fosmid XAAA112 consists of 34,783 nucleotides. Blat results indicate that this fosmid has significant identity to the 2R chromosome of D.melanogaster. Evidence suggests that fosmid XAAA112 contains

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

HMMs and biological sequence analysis

HMMs and biological sequence analysis HMMs and biological sequence analysis Hidden Markov Model A Markov chain is a sequence of random variables X 1, X 2, X 3,... That has the property that the value of the current state depends only on the

More information

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster. NCBI BLAST Services DELTA-BLAST BLAST (http://blast.ncbi.nlm.nih.gov/), Basic Local Alignment Search tool, is a suite of programs for finding similarities between biological sequences. DELTA-BLAST is a

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

1. In most cases, genes code for and it is that

1. In most cases, genes code for and it is that Name Chapter 10 Reading Guide From DNA to Protein: Gene Expression Concept 10.1 Genetics Shows That Genes Code for Proteins 1. In most cases, genes code for and it is that determine. 2. Describe what Garrod

More information

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Molecular Biology-2018 1 Definitions: RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES Heterologues: Genes or proteins that possess different sequences and activities. Homologues: Genes or proteins that

More information

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega BLAST Multiple Sequence Alignments: Clustal Omega What does basic BLAST do (e.g. what is input sequence and how does BLAST look for matches?) Susan Parrish McDaniel College Multiple Sequence Alignments

More information

Using Bioinformatics to Study Evolutionary Relationships Instructions

Using Bioinformatics to Study Evolutionary Relationships Instructions 3 Using Bioinformatics to Study Evolutionary Relationships Instructions Student Researcher Background: Making and Using Multiple Sequence Alignments One of the primary tasks of genetic researchers is comparing

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

-max_target_seqs: maximum number of targets to report

-max_target_seqs: maximum number of targets to report Review of exercise 1 tblastn -num_threads 2 -db contig -query DH10B.fasta -out blastout.xls -evalue 1e-10 -outfmt "6 qseqid sseqid qstart qend sstart send length nident pident evalue" Other options: -max_target_seqs:

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

A Browser for Pig Genome Data

A Browser for Pig Genome Data A Browser for Pig Genome Data Thomas Mailund January 2, 2004 This report briefly describe the blast and alignment data available at http://www.daimi.au.dk/ mailund/pig-genome/ hits.html. The report describes

More information

Procedure to Create NCBI KOGS

Procedure to Create NCBI KOGS Procedure to Create NCBI KOGS full details in: Tatusov et al (2003) BMC Bioinformatics 4:41. 1. Detect and mask typical repetitive domains Reason: masking prevents spurious lumping of non-orthologs based

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Supplemental Materials

Supplemental Materials JOURNAL OF MICROBIOLOGY & BIOLOGY EDUCATION, May 2013, p. 107-109 DOI: http://dx.doi.org/10.1128/jmbe.v14i1.496 Supplemental Materials for Engaging Students in a Bioinformatics Activity to Introduce Gene

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11 The Eukaryotic Genome and Its Expression Lecture Series 11 The Eukaryotic Genome and Its Expression A. The Eukaryotic Genome B. Repetitive Sequences (rem: teleomeres) C. The Structures of Protein-Coding

More information

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION BIOINFORMATICS Vol. 1 no. 1 2003 Pages 1 1 PILER: identification and classification of genomic repeats Robert C. Edgar 1* and Eugene W. Myers 2 1 195 Roque Moraes Drive, Mill Valley, CA, U.S.A., bob@drive5.com.

More information

Heuristic Alignment and Searching

Heuristic Alignment and Searching 3/28/2012 Types of alignments Global Alignment Each letter of each sequence is aligned to a letter or a gap (e.g., Needleman-Wunsch). Local Alignment An optimal pair of subsequences is taken from the two

More information

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid. 1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the

More information

EBI web resources II: Ensembl and InterPro

EBI web resources II: Ensembl and InterPro EBI web resources II: Ensembl and InterPro Yanbin Yin http://www.ebi.ac.uk/training/online/course/ 1 Homework 3 Go to http://www.ebi.ac.uk/interpro/training.htmland finish the second online training course

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION med!1,2 Wild-type (N2) end!3 elt!2 5 1 15 Time (minutes) 5 1 15 Time (minutes) med!1,2 end!3 5 1 15 Time (minutes) elt!2 5 1 15 Time (minutes) Supplementary Figure 1: Number of med-1,2, end-3, end-1 and

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Student Handout Fruit Fly Ethomics & Genomics

Student Handout Fruit Fly Ethomics & Genomics Student Handout Fruit Fly Ethomics & Genomics Summary of Laboratory Exercise In this laboratory unit, students will connect behavioral phenotypes to their underlying genes and molecules in the model genetic

More information

objective functions...

objective functions... objective functions... COFFEE (Notredame et al. 1998) measures column by column similarity between pairwise and multiple sequence alignments assumes that the pairwise alignments are optimal assumes a set

More information

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) 2 19 2015 Scribe: John Ekins Multiple Sequence Alignment Given N sequences x 1, x 2,, x N : Insert gaps in each of the sequences

More information

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites Paper by: James P. Balhoff and Gregory A. Wray Presentation by: Stephanie Lucas Reviewed

More information

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1.

Motifs and Logos. Six Introduction to Bioinformatics. Importance and Abundance of Motifs. Getting the CDS. From DNA to Protein 6.1. Motifs and Logos Six Discovering Genomics, Proteomics, and Bioinformatics by A. Malcolm Campbell and Laurie J. Heyer Chapter 2 Genome Sequence Acquisition and Analysis Sami Khuri Department of Computer

More information

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype Lecture Series 7 From DNA to Protein: Genotype to Phenotype Reading Assignments Read Chapter 7 From DNA to Protein A. Genes and the Synthesis of Polypeptides Genes are made up of DNA and are expressed

More information

In Genomes, Two Types of Genes

In Genomes, Two Types of Genes In Genomes, Two Types of Genes Protein-coding: [Start codon] [codon 1] [codon 2] [ ] [Stop codon] + DNA codons translated to amino acids to form a protein Non-coding RNAs (NcRNAs) No consistent patterns

More information

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions? 1 Supporting Information: What Evidence is There for the Homology of Protein-Protein Interactions? Anna C. F. Lewis, Nick S. Jones, Mason A. Porter, Charlotte M. Deane Supplementary text for the section

More information

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven) BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1 Multiple whole genome alignment Input set of whole genome sequences genomes diverged

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

From Gene to Protein

From Gene to Protein From Gene to Protein Gene Expression Process by which DNA directs the synthesis of a protein 2 stages transcription translation All organisms One gene one protein 1. Transcription of DNA Gene Composed

More information

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

Lecture 18 June 2 nd, Gene Expression Regulation Mutations Lecture 18 June 2 nd, 2016 Gene Expression Regulation Mutations From Gene to Protein Central Dogma Replication DNA RNA PROTEIN Transcription Translation RNA Viruses: genome is RNA Reverse Transcriptase

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting. Genome Annotation Bioinformatics and Computational Biology Genome Annotation Frank Oliver Glöckner 1 Genome Analysis Roadmap Genome sequencing Assembly Gene prediction Protein targeting trna prediction

More information

Genome-wide analysis of the MYB transcription factor superfamily in soybean

Genome-wide analysis of the MYB transcription factor superfamily in soybean Du et al. BMC Plant Biology 2012, 12:106 RESEARCH ARTICLE Open Access Genome-wide analysis of the MYB transcription factor superfamily in soybean Hai Du 1,2,3, Si-Si Yang 1,2, Zhe Liang 4, Bo-Run Feng

More information

Comparative Bioinformatics Midterm II Fall 2004

Comparative Bioinformatics Midterm II Fall 2004 Comparative Bioinformatics Midterm II Fall 2004 Objective Answer, part I: For each of the following, select the single best answer or completion of the phrase. (3 points each) 1. Deinococcus radiodurans

More information

RNA Processing: Eukaryotic mrnas

RNA Processing: Eukaryotic mrnas RNA Processing: Eukaryotic mrnas Eukaryotic mrnas have three main parts (Figure 13.8): 5! untranslated region (5! UTR), varies in length. The coding sequence specifies the amino acid sequence of the protein

More information

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering Proteomics 2 nd semester, 2013 1 Text book Principles of Proteomics by R. M. Twyman, BIOS Scientific Publications Other Reference books 1) Proteomics by C. David O Connor and B. David Hames, Scion Publishing

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University Genome Annotation Qi Sun Bioinformatics Facility Cornell University Some basic bioinformatics tools BLAST PSI-BLAST - Position-Specific Scoring Matrix HMM - Hidden Markov Model NCBI BLAST How does BLAST

More information

Hands-On Nine The PAX6 Gene and Protein

Hands-On Nine The PAX6 Gene and Protein Hands-On Nine The PAX6 Gene and Protein Main Purpose of Hands-On Activity: Using bioinformatics tools to examine the sequences, homology, and disease relevance of the Pax6: a master gene of eye formation.

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Introduction to Hidden Markov Models for Gene Prediction ECE-S690 Outline Markov Models The Hidden Part How can we use this for gene prediction? Learning Models Want to recognize patterns (e.g. sequence

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

Genomics Education Partnership Fosmid 16 Harry Quedenfeld. Data and text from this paper is allowed to be included in publications

Genomics Education Partnership Fosmid 16 Harry Quedenfeld. Data and text from this paper is allowed to be included in publications TheCharacterizationofOrthologousCG14561, CG7139 PA,CG7139 PB,CG7133,CG7130andthe IneffectualExclusionofDrosophilamelanogaster ExonicSequenceinRpLP0inDrosophilaerecta GenomicsEducationPartnership Fosmid16

More information

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2 Cellular Neuroanatomy I The Prototypical Neuron: Soma Reading: BCP Chapter 2 Functional Unit of the Nervous System The functional unit of the nervous system is the neuron. Neurons are cells specialized

More information

Chapter 17. From Gene to Protein. Biology Kevin Dees

Chapter 17. From Gene to Protein. Biology Kevin Dees Chapter 17 From Gene to Protein DNA The information molecule Sequences of bases is a code DNA organized in to chromosomes Chromosomes are organized into genes What do the genes actually say??? Reflecting

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

Bio2. Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding. Biological Databases (sequences) Armstrong, 2007 Bioinformatics 2

Bio2. Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding. Biological Databases (sequences) Armstrong, 2007 Bioinformatics 2 Bio2 Heuristics, Databases ; Multiple Sequence Alignment ; Gene Finding Biological Databases (sequences) 1 Biological Databases Introduction to Sequence Databases Overview of primary query tools and the

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

RNA- seq read mapping

RNA- seq read mapping RNA- seq read mapping Pär Engström SciLifeLab RNA- seq workshop October 216 IniDal steps in RNA- seq data processing 1. Quality checks on reads 2. Trim 3' adapters (opdonal (for species with a reference

More information

Homology and Information Gathering and Domain Annotation for Proteins

Homology and Information Gathering and Domain Annotation for Proteins Homology and Information Gathering and Domain Annotation for Proteins Outline Homology Information Gathering for Proteins Domain Annotation for Proteins Examples and exercises The concept of homology The

More information

Chapter 15 Active Reading Guide Regulation of Gene Expression

Chapter 15 Active Reading Guide Regulation of Gene Expression Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,

More information

Sequences, Structures, and Gene Regulatory Networks

Sequences, Structures, and Gene Regulatory Networks Sequences, Structures, and Gene Regulatory Networks Learning Outcomes After this class, you will Understand gene expression and protein structure in more detail Appreciate why biologists like to align

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

2 Genome evolution: gene fusion versus gene fission

2 Genome evolution: gene fusion versus gene fission 2 Genome evolution: gene fusion versus gene fission Berend Snel, Peer Bork and Martijn A. Huynen Trends in Genetics 16 (2000) 9-11 13 Chapter 2 Introduction With the advent of complete genome sequencing,

More information

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure

Frazer et al. ago (Aparicio et al. 2002), conserved long-range sequence organization has not been reported for more distantly related species. Figure Review Cross-Species Sequence Comparisons: A Review of Methods and Available Resources Kelly A. Frazer, 1,6 Laura Elnitski, 2,3 Deanna M. Church, 4 Inna Dubchak, 5 and Ross C. Hardison 3 1 Perlegen Sciences,

More information

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors

Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Quantitative Measurement of Genome-wide Protein Domain Co-occurrence of Transcription Factors Arli Parikesit, Peter F. Stadler, Sonja J. Prohaska Bioinformatics Group Institute of Computer Science University

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Computational Biology and Chemistry

Computational Biology and Chemistry Computational Biology and Chemistry 33 (2009) 245 252 Contents lists available at ScienceDirect Computational Biology and Chemistry journal homepage: www.elsevier.com/locate/compbiolchem Research Article

More information

Protein function prediction based on sequence analysis

Protein function prediction based on sequence analysis Performing sequence searches Post-Blast analysis, Using profiles and pattern-matching Protein function prediction based on sequence analysis Slides from a lecture on MOL204 - Applied Bioinformatics 18-Oct-2005

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

BME 5742 Biosystems Modeling and Control

BME 5742 Biosystems Modeling and Control BME 5742 Biosystems Modeling and Control Lecture 24 Unregulated Gene Expression Model Dr. Zvi Roth (FAU) 1 The genetic material inside a cell, encoded in its DNA, governs the response of a cell to various

More information

15.2 Prokaryotic Transcription *

15.2 Prokaryotic Transcription * OpenStax-CNX module: m52697 1 15.2 Prokaryotic Transcription * Shannon McDermott Based on Prokaryotic Transcription by OpenStax This work is produced by OpenStax-CNX and licensed under the Creative Commons

More information

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова Предсказание и анализ промотерных последовательностей Татьяна Татаринова Eukaryotic Transcription 2 Initiation Promoter: the DNA sequence that initially binds the RNA polymerase The structure of promoter-polymerase

More information

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013 EBI web resources II: Ensembl and InterPro Yanbin Yin Spring 2013 1 Outline Intro to genome annotation Protein family/domain databases InterPro, Pfam, Superfamily etc. Genome browser Ensembl Hands on Practice

More information

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution Background How does an evolutionary biologist decide how closely related two different species are? The simplest way is to compare

More information

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline CHAPTER 3 Cell Structure and Genetic Control Chapter 3 Outline Plasma Membrane Cytoplasm and Its Organelles Cell Nucleus and Gene Expression Protein Synthesis and Secretion DNA Synthesis and Cell Division

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: m Eukaryotic mrna processing Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus: Cap structure a modified guanine base is added to the 5 end. Poly-A tail

More information

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) Annotation of Plant Genomes using RNA-seq Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA) inuscu1-35bp 5 _ 0 _ 5 _ What is Annotation inuscu2-75bp luscu1-75bp 0 _ 5 _ Reconstruction

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction

Heuristic Methods. Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction Heuristic methods for alignment Sequence databases Multiple alignment Gene and protein prediction Armstrong, 2010 Heuristic Methods! FASTA! BLAST! Gapped BLAST! PSI-BLAST Armstrong, 2010 1 Assumptions

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever.

Related Courses He who asks is a fool for five minutes, but he who does not ask remains a fool forever. CSE 527 Computational Biology http://www.cs.washington.edu/527 Lecture 1: Overview & Bio Review Autumn 2004 Larry Ruzzo Related Courses He who asks is a fool for five minutes, but he who does not ask remains

More information