GEP Annotation Report

Similar documents
Annotation of Drosophila grimashawi Contig12

BLAST. Varieties of BLAST

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Supplemental Materials

Using Bioinformatics to Study Evolutionary Relationships Instructions

Synteny Portal Documentation

Bioinformatics and BLAST

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

1. In most cases, genes code for and it is that

-max_target_seqs: maximum number of targets to report

Genome Annotation. Qi Sun Bioinformatics Facility Cornell University

Heuristic Alignment and Searching

RNA- seq read mapping

A Browser for Pig Genome Data

In Genomes, Two Types of Genes

RNA Processing: Eukaryotic mrnas

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

EBI web resources II: Ensembl and InterPro

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Comparing whole genomes

Basic Local Alignment Search Tool

AP Bio Module 16: Bacterial Genetics and Operons, Student Learning Guide

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

MegAlign Pro Pairwise Alignment Tutorials

objective functions...

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Sequences, Structures, and Gene Regulatory Networks

Genome Browsers And Genome Databases. Andy Conley Computational Genomics 2009

Markov Models & DNA Sequence Evolution

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

Araport, a community portal for Arabidopsis. Data integration, sharing and reuse. sergio contrino University of Cambridge

Student Handout Fruit Fly Ethomics & Genomics

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

HMMs and biological sequence analysis

Procedure to Create NCBI KOGS

1 Abstract. 2 Introduction. 3 Requirements. 4 Procedure

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

The Eukaryotic Genome and Its Expression. The Eukaryotic Genome and Its Expression. A. The Eukaryotic Genome. Lecture Series 11

Supplemental Information

Browsing Genomic Information with Ensembl Plants

Small RNA in rice genome

RGP finder: prediction of Genomic Islands

Open a Word document to record answers to any italicized questions. You will the final document to me at

Hands-On Nine The PAX6 Gene and Protein

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

BIOINFORMATICS LAB AP BIOLOGY

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Bioinformatics Exercises

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

EBI web resources II: Ensembl and InterPro. Yanbin Yin Spring 2013

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline

Visit to BPRC. Data is crucial! Case study: Evolution of AIRE protein 6/7/13

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

RNA & PROTEIN SYNTHESIS. Making Proteins Using Directions From DNA

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

O 3 O 4 O 5. q 3. q 4. Transition

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Comparative genomics: Overview & Tools + MUMmer algorithm

Experiment 1 Scientific Writing Tools

Computational Identification of Evolutionarily Conserved Exons

Detecting non-coding RNA in Genomic Sequences

Correspondence of D. melanogaster and C. elegans developmental stages revealed by alternative splicing characteristics of conserved exons

Introduction to Bioinformatics

Collisions and conservation laws

Isoform discovery and quantification from RNA-Seq data

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

MiGA: The Microbial Genome Atlas

TRANSCRIPTION VS TRANSLATION FILE

Comparative analysis of RNA- Seq data with DESeq2

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Mitochondrial Genome Annotation

Bioinformatics for Biologists

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Evaluation. Course Homepage.

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Reading Assignments. A. Genes and the Synthesis of Polypeptides. Lecture Series 7 From DNA to Protein: Genotype to Phenotype

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Genome-wide analysis of the MYB transcription factor superfamily in soybean

EECS730: Introduction to Bioinformatics

Statistical Inferences for Isoform Expression in RNA-Seq

BIOINFORMATICS. PILER: identification and classification of genomic repeats. Robert C. Edgar 1* and Eugene W. Myers 2 1 INTRODUCTION

Name: SAMPLE EOC PROBLEMS

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CSCE555 Bioinformatics. Protein Function Annotation

EBI web resources II: Ensembl and InterPro

Bayesian Clustering of Multi-Omics

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

The Developmental Transcriptome of the Mosquito Aedes aegypti, an invasive species and major arbovirus vector.

Session 5: Phylogenomics

Section 7. Junaid Malek, M.D.

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Abstract. CROMER, JOHN PATRICK. Genome Annotation of Meloidogyne hapla. (Under the direction of David Bird and Charles Opperman).

Transcription:

GEP Annotation Report Note: For each gene described in this annotation report, you should also prepare the corresponding GFF, transcript and peptide sequence files as part of your submission. Student name: Wilson Leung Student email: student@example.edu Faculty advisor: Sarah C.R. Elgin College/university: Washington University in St. Louis Project details Project name: contig10 Project species: D. biarmipes Date of submission: 12/30/2018 Size of project in base pairs: 43,013 Number of genes in project: 3 Does this report cover all of the genes or is it a partial report? Partial report If this is a partial report, please indicate the region of the project covered by this report: From base 25,000 to base 28,000 Instructions for project with no genes If you believe that the project does not contain any genes, please provide the following evidence to support your conclusion: 1. Perform a NCBI BLASTX search of the entire contig sequence against the nonredundant protein sequences (nr) database. Provide an explanation for any significant (E-value < 1e-5) hits to known genes in the nr database as to why they do not correspond to real genes in the project. 2. For each Genscan prediction, perform a NCBI BLASTP search of the predicted amino acid sequence against the nr protein database using the strategy described above. 3. Examine the gene expression tracks (e.g., RNA-Seq) for evidence of transcribed regions that do not correspond to alignments to known D. melanogaster proteins. Perform a NCBI BLASTX search against the nr protein database using these genomic regions to determine if they show sequence similarity to known or predicted proteins in the nr database. 1

Complete the following Gene Report Form for each gene in your project. Copy and paste the sections below to create as many copies as needed within this report. Be sure to create enough Isoform Report Forms within your Gene Report Form for all isoforms. Gene report form Gene name (e.g., D. biarmipes eyeless): D. biarmipes CG31997 Gene symbol (e.g., dbia_ey): dbia_cg31997 Approximate location in project (from 5 end to 3 end): 25673-27471 Number of isoforms in D. melanogaster: 2 Number of isoforms in this project: 2 Complete the following table for all the isoforms in this project: Name(s) of unique isoform(s) based on coding sequence CG31997-PB List of isoforms with identical coding sequences CG31997-PA Names of the isoforms with unique coding sequences in D. melanogaster that are absent in this species: NA Note: For isoforms with identical coding sequence, you only need to complete the Isoform Report Form for one of these isoforms (i.e. using the name of the isoform listed in the left column of the table above). However, you should generate GFF, transcript, and peptide sequence files for ALL isoforms, irrespective of whether they have identical coding sequences as other isoforms. Consensus sequence errors report form Complete this section if you have identified errors in the project consensus sequence that affect the annotation of the gene described above. All of the coordinates reported in this section should be relative to the coordinates of the original project sequence. Location(s) in the project sequence with consensus errors: NA 2

Isoform report form Complete this report form for each unique isoform listed in the table above. Copy and paste this form to create as many copies of this Isoform Report Form as needed. Gene-isoform name (e.g., dbia_ey-pa): dbia_cg31997-pb Names of the isoforms with identical coding sequences as this isoform: dbia_cg31997-pa Is the 5 end of this isoform missing from the end of the project? No If so, how many exons are missing from the 5 end: Is the 3 end of this isoform missing from the end of the project? No If so, how many exons are missing from the 3 end: 1. Gene Model Checker checklist Enter the coordinates of your final gene model for this isoform into the Gene Model Checker and paste a screenshot of the checklist results into the box below: Note: For projects with consensus sequence errors, report the exon coordinates relative to the original project sequence. Include the VCF file you have generated above when you submit the gene model to the Gene Model Checker. The Gene Model Checker will use this VCF file to automatically revise the submitted exon coordinates. 3

2. View the gene model on the Genome Browser Use the custom track feature from the Gene Model Checker to capture a screenshot of your gene model shown on the Genome Browser for your project. Zoom in so that only this isoform is in the screenshot. (See page 12 of the Gene Model Checker user guide on how to do this; you can find the guide under Help Documentations Web Framework on the GEP website at http://gep.wustl.edu.) Include the following evidence tracks in the screenshot if they are available: 1. A sequence alignment track (D. mel Proteins or Other RefSeq) 2. At least one gene prediction track (e.g., Genscan) 3. At least one RNA-Seq track (e.g., RNA-Seq Alignment Summary) 4. A comparative genomics track (e.g., Conservation, D. mel. Net Alignment) Paste a screenshot of your gene model as shown on the GEP UCSC Genome Browser into the box below: Low-frequency RNA-Seq exon junctions not annotated: The evidence from the RNA-Seq TopHat tracks and Multiz alignments suggest that there might be additional isoforms because of alternative splicing at the 5' end of this gene (red arrows in the screenshot above). However, because most of the TopHat junctions are supported by less than 10 reads, there is insufficient evidence to postulate the presence of multiple novel isoforms in D. biarmipes compared to D. melanogaster. 4

Extra CDS predicted by the SNAP gene predictor: SNAP predicted a CDS at 26,502-26,584 (blue arrow in the screenshot above) between the first and second CDS's of CG31997. The RNA-Seq Alignment Summary track shows that the region surrounding this region has low (<20 reads) RNA-Seq read coverage and the region is adjacent to a hat DNA transposon fragment (see screenshot below). NCBI BLASTX search of the genomic region surrounding the SNAP CDS prediction (contig10:26400-26700) against the nr database did not detect any significant (E-value < 1e-5) sequence similarity to known proteins in the nr database (see screenshot below). 5

A NCBI BLASTN search of this region against the nt database detected five significant matches to predicted mrnas in Drosophila suzukii (see screenshot below). The E-values for these D. suzukii matches range from 3e-10 to 2e-06, and they correspond to three different predicted genes (LOC108013970, LOC108011950, and LOC108014610). All of these matches are RefSeq predictions that have not been confirmed experimentally. There are no significant matches to RefSeq records that are supported by experimental evidence and no significant matches to mrnas in other species besides D. suzukii. Collectively, while we could not reject the possibility that this region of contig10 contains an untranslated region of a nearby gene, there is insufficient evidence to postulate a novel isoform of CG31997 in D. biarmipes compared to D. melanogaster. Given the proximity of this feature to the hat DNA transposon and the multiple matches to predicted transcripts in D. suzukii, an alternative explanation is that the feature is part of a transposon that is found in both D. biarmipes and D. suzukii. Hence we have omitted this predicted CDS in our annotation of the CG31997 ortholog in D. biarmipes. 6

3. Alignment between the submitted model and the D. melanogaster ortholog Show an alignment between the protein sequence for your gene model and the protein sequence from the putative D. melanogaster ortholog. You can either use the protein alignment generated by the Gene Model Checker (available through the View protein alignment link under the Dot Plot tab) or you can generate a new alignment using the Align two or more sequences feature (bl2seq) at the NCBI BLAST web site. Paste a screenshot of the protein alignment into the box below: 4. Dot plot between the submitted model and the D. melanogaster ortholog Paste a screenshot of the dot plot of your submitted model against the putative D. melanogaster ortholog (generated by the Gene Model Checker) into the box below. Provide an explanation for any anomalies on the dot plot (e.g., large gaps, regions with no sequence similarity). 7

Note: Large vertical and horizontal gaps near exon boundaries in the dot plot often indicate that an incorrect splice site might have been picked. Please re-examine these regions and provide a justification as to why you have selected this particular set of donor and acceptor sites. The dot plot shows that the last two CDS's of CG31997-PB are highly conserved between the proposed D. biarmipes gene model and the D. melanogaster ortholog. Examination of the protein alignment at the end of the second and third CDS's indicate that the amino acids have similar chemical properties even though they are not identical. In addition, the lengths of these two CDS's are the same between D. biarmipes and D. melanogaster. The dot plot shows that the beginning of the first CDS of CG31997-PB is only weakly conserved between D. biarmipes and D. melanogaster. In addition, the dot plot shows that the first CDS of the D. biarmipes gene model is longer than the orthologous CDS in D. melanogaster. The protein alignment shows that there are 8 additional amino acids within the first CDS in the proposed D. biarmipes gene model compared to D. melanogaster. Examination of this region in the GEP UCSC Genome Browser shows that there is only one methionine in frame +2 that could serve as the start codon for CG31997-PB (see screenshot below). The expansion of this CDS is consistent with the BLASTX alignment, the N-SCAN gene prediction, and the available RNA-Seq data. Consequently, our annotation has expanded the size of this CDS (1_10777_0) in order to retain this isoform in D. biarmipes. 8