Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI

Similar documents
Browsing Genomic Information with Ensembl Plants

Synteny Portal Documentation

Comparing whole genomes

Using Bioinformatics to Study Evolutionary Relationships Instructions

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Student Handout Fruit Fly Ethomics & Genomics

Hands-On Nine The PAX6 Gene and Protein

The MANTiS Manual. Contents. MANTiS Version 1.1

training workshop 2015

MegAlign Pro Pairwise Alignment Tutorials

Emily Blanton Phylogeny Lab Report May 2009

BIOINFORMATICS LAB AP BIOLOGY

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Bioinformatics Exercises

Space Objects. Section. When you finish this section, you should understand the following:

SeeSAR 7.1 Beginners Guide. June 2017

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

RGP finder: prediction of Genomic Islands

Open a Word document to record answers to any italicized questions. You will the final document to me at

Ligand Scout Tutorials

GEP Annotation Report

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

ISIS/Draw "Quick Start"

1. Understand the methods for analyzing population structure in genomes

Project Manual Bio3055. Apoptosis: Caspase-1

A Browser for Pig Genome Data

NMR Predictor. Introduction

VCell Tutorial. Building a Rule-Based Model

Investigation 3: Comparing DNA Sequences to Understand Evolutionary Relationships with BLAST

Lesson Plan 2 - Middle and High School Land Use and Land Cover Introduction. Understanding Land Use and Land Cover using Google Earth

Introduction to Bioinformatics Online Course: IBT

Relative Photometry with data from the Peter van de Kamp Observatory D. Cohen and E. Jensen (v.1.0 October 19, 2014)

Comparing Genomes! Homologies and Families! Sequence Alignments!

Homology and Information Gathering and Domain Annotation for Proteins

ST-Links. SpatialKit. Version 3.0.x. For ArcMap. ArcMap Extension for Directly Connecting to Spatial Databases. ST-Links Corporation.

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Androgen-independent prostate cancer

Geodatabases and ArcCatalog

Data Structures & Database Queries in GIS

Performing a Pharmacophore Search using CSD-CrossMiner

Homology. and. Information Gathering and Domain Annotation for Proteins

User Guide. Affirmatively Furthering Fair Housing Data and Mapping Tool. U.S. Department of Housing and Urban Development

- conserved in Eukaryotes. - proteins in the cluster have identifiable conserved domains. - human gene should be included in the cluster.

PDF-4+ Tools and Searches

Supporting Information

ICM-Chemist How-To Guide. Version 3.6-1g Last Updated 12/01/2009

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

You w i ll f ol l ow these st eps : Before opening files, the S c e n e panel is active.

Exercises for Windows

Tutorial 12 Excess Pore Pressure (B-bar method) Undrained loading (B-bar method) Initial pore pressure Excess pore pressure

Electric Fields and Equipotentials

Introduction to simulation databases with ADQL and Topcat

PDF-2 Tools and Searches

SECOORA Data Portal Exercises

OECD QSAR Toolbox v.4.1. Tutorial illustrating new options for grouping with metabolism

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Bioinformatics: Investigating Molecular/Biochemical Evidence for Evolution

PDF-4+ Tools and Searches

Consents Resource Consents Map

Please click the link below to view the YouTube video offering guidance to purchasers:

(THIS IS AN OPTIONAL BUT WORTHWHILE EXERCISE)

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Spatial Data Analysis in Archaeology Anthropology 589b. Kriging Artifact Density Surfaces in ArcGIS

1 Introduction. Abstract

Presenting Tree Inventory. Tomislav Sapic GIS Technologist Faculty of Natural Resources Management Lakehead University

Quality Measures (QM) Report. Self Guided Tutorial

Computer simulation of radioactive decay

Session 5: Phylogenomics

Supplementary Information

Task 1: Start ArcMap and add the county boundary data from your downloaded dataset to the data frame.

Sequences, Structures, and Gene Regulatory Networks

Systematic prediction of gene function in Arabidopsis thaliana using a probabilistic functional gene network

85. Geo Processing Mineral Liberation Data

Tutorial. Getting started. Sample to Insight. March 31, 2016

Appendix B Microsoft Office Specialist exam objectives maps

Journal of Proteomics & Bioinformatics - Open Access

Uta Bilow, Carsten Bittrich, Constanze Hasterok, Konrad Jende, Michael Kobel, Christian Rudolph, Felix Socher, Julia Woithe

Introduction to ArcGIS 10.2

Geodatabases and ArcCatalog

Data Mining with the PDF-4 Databases. FeO Non-stoichiometric Oxides

M E R C E R W I N WA L K T H R O U G H

Phylogenetics - Orthology, phylogenetic experimental design and phylogeny reconstruction. Lesser Tenrec (Echinops telfairi)

OECD QSAR Toolbox v.4.0. Tutorial on how to predict Skin sensitization potential taking into account alert performance

Watershed Modeling Orange County Hydrology Using GIS Data

How many states. Record high temperature

BLAST. Varieties of BLAST

Automatic Watershed Delineation using ArcSWAT/Arc GIS

Cerno Application Note Extending the Limits of Mass Spectrometry

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

OECD QSAR Toolbox v.4.1. Tutorial on how to predict Skin sensitization potential taking into account alert performance

85. Geo Processing Mineral Liberation Data

Practical considerations of working with sequencing data

The Quantizing functions

A Database of human biological pathways

Tutorial: Structural Analysis of a Protein-Protein Complex

Basic Local Alignment Search Tool

Preparations and Starting the program

Transcription:

Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI Exercise 1 Exploring the human MYH9 gene (a) Go to the Ensembl homepage (http://www.ensembl.org). Select Search: Human and type MYH9 gene Click [Go]. Click on Homo sapiens on the page with search results. Click on Gene. Click on Ensembl protein_coding Gene: ENSG00000100345 (HGNC Symbol: MYH9). Chromosome 22 on the reverse strand. Ensembl has 11 transcripts annotated for this gene. Three transcripts are protein coding. The longest transcript is MYH9-001 and it codes for a protein of 1960 amino acids MYH9-001 has a CCDS record. CCDS is the consensus coding sequence set. These coding sequences (CDS) have been agreed upon by Ensembl, NCBI, UCSC and Havana. The CCDS set is a collection of reviewed, agreed-upon coding sequences (for human mouse). These sequences are high- confidence, and unlikely to change in the future. (b) These are some of the phenotypes associated to MYH9 according to MIM: autosomal dominant deafness, Epstein syndrome, and Fechtner syndrome. Click on any of these for more information in the MIM record itself. (c) Click on ENST00000216181 It has 41 exons. This is shown in the Transcript summary. Click on Exons in the side menu. Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in purple). You can also see this in the cdna view. Click on General identifiers in the side menu. MYH9-HUMAN from Swiss-Prot matches the Ensembl transcript. Click on it to go to UniProtKB, or click align for the alignment between the Ensembl translation and the Swiss-Prot record. Have a look at Ontology table. The Gene Ontology project (http://www.geneontology.org/) maps terms to a protein in three classes: biological process, cellular component, and molecular function. Meiotic spindle organization, cell morphogenesis, and cytokinesis are some of the roles associated with MYH9-001. (d) Click on Oligo probes in the side menu. Probesets from Affymetrix, Agilent, Codelink, Illumina, and Phalanx match to this transcript sequence. Expression analysis with any of these probesets would reveal information about the transcript. Hint: this information can sometimes be found in the ArrayExpress Atlas: www.ebi.ac.uk/arrayexpress/ Exercise 2 - Exploring a genomic region in human (a) Go to the Ensembl homepage (http://www.ensembl.org/). Select Search: Human and type 13:32448000-33198000 in the for text box (or alternatively leave the Search drop-down list like it is and type human 13:32448000-33198000 in the for text box). Click [Go]. This genomic region is located on cytogenetic band q13.1. It is made up of seven contigs, indicated by the alternating light and dark blue coloured bars in the Contigs track. (b) Draw with your mouse a box encompassing the BRCA2 transcripts. Click on Jump to region in the pop-up menu. (c) Click [Configure this page] in the side menu. Type clones in the Find a track text box. Select 1Mb clone set, 32k clone set and Tilepath. Click ( ).There is not one single clone only that contains the complete BRCA2 gene. For example clone RP11-37E23 contains most of the gene, but not its very 3 end. This was reflected on the two contigs needed to make up the BRCA2 gene (the Contigs

track is on by default). (d) Click [Configure this page] in the side menu. Type refseq in the Find a track text box. Select Human RefSeq import Expanded with labels. Click ( ).Click on individual transcript models (RefSeq or otherwise) to retrieve more information about them.there has been one transcript annotated by RefSeq for the BRCA2 gene, i.e. NM_000059.3. This transcript is almost identical to Ensembl transcript BRCA2-001 (ENST00000380152). Both encode a 3418 aa protein but the RefSeq transcript is shorter at the 5 UTR and longer at the 5 UTR. (e) Click [Export data] in the side menu. Click [Next>]. Click on Text.Note that the sequence has a header that provides information about the genome assembly (GRCh37), the chromosome number, the start and end coordinates and the strand. For example:>13 dna:chromosome chromosome:grch37:13:32883613:32978196:1 (f) Click [Configure this page] in the side menu. Click [Reset configuration]. Click ( ). Exercise 3 Exploring a sequence variant (human) (a) Go to the Ensembl homepage (http://www.ensembl.org/). Select Search: Human and type f5 in the for text box. Click [Go]. Click on Variation table under F5 (Human Gene). Click on Show for Missense variant in the Summary of variation consequences in ENSG00000198734 table. Type 534 in the Filter text box. The dbsnp accession number for the Arg534Gln (Q/R) variant is rs6025. Note that HGVS (Human Genome Variation Society) notations are not by default shown in the table. They can be added as follows: Click on Configure this page in the side menu. Click on Consequence options. Check Show HGVS notations. Click( ) (b) rs6025 is supported by all six possible types of evidence (represented by icons), i.e. Multiple observations (the variant has multiple independent dbsnp submissions, i.e. submissions with different submitter handles or different discovery samples), Frequency (the variant is reported to be polymorphic in at least one sample), HapMap (the variant is polymorphic in at least one HapMap panel), 1000 Genomes (the variant was discovered in the 1000 Genomes Project), Cited (dbsnp holds a citation from PubMed for the variant) and ESP (the variant was discovered in the Exome Sequencing Project). (c) Click on rs6025.no, rs6025 is missense for two F5 transcripts. It is 3 prime UTR for one F5 transcript, i.e. ENST00000546081. Note that in total four transcripts have been annotated for the F5 gene: http://www.ensembl.org/homo_sapiens/gene/summary?db=core;g=ensg00000198734. (d) In Ensembl the alleles of rs6025 are given as T and C, because these are the alleles in the forward strand of the genome. In dbsnp the alleles are given as A and G because the person(s) who submitted this variant apparently had sequenced the reverse strand of the genome. In literature the alleles are mostly given as A and G, because the F5 gene is located on the reverse strand of the genome, thus the alleles in the actual gene and transcript sequences are A and G. (e) Ensembl puts the allele that is present in the GRCh37 reference genome first, i.e. T (forward strand). In the case of rs6025 this is the minor allele. That the reference genome can contain the minor allele for a variant is because it is an amalgamation of the genomes of just a few individuals and not a reference in the sense of a representation of what is most common in the human population as a whole. In the literature normally the major allele (in the population of interest) is put first.

(f) rs6025 is predicted to be tolerated and benign according to SIFT and PolyPhen, because they predict the effect of the change from reference allele to alternate allele, i.e. from T (minor allele) to C (major allele). (g) Click on Population genetics in the side menu. Yes, there is ethnic variation in the frequency of the T allele. Among the 1000 Genomes populations studied, it ranges from 0 in the various African and East Asian populations to 0.029 in the CEU (Utah Residents (CEPH) with Northern and Western European ancestry) population. (h) Click on Phenotype Data in the side menu. rs6025 has been associated with a number of different phenotypes, i.e. venous thromboembolism, susceptibility to Budd-Chiari syndrome, recurrent abortion, thrombophilia due to activated protein C resistance, thrombophilia due to factor V Leiden and susceptibility to ischemic stroke. (i) Click on Phylogenetic Context in the side menu. Gorilla, orangutan, macaque and marmoset all have a C in this position, which confirms that C is indeed the ancestral allele. (j) Go to the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org/).type rs6025 in the Search Neandertal text box. Click [Go]. Click on rs6025. Click on Jump to region in detail. Click on Configure this page in the side menu. Click on Variation features. Select All variations Normal. Click [SAVE and close]. Draw a box of about 50 bp around rs6025 (shown in yellow in the center of the display). Click on Jump to region in the pop-up menu. The Sequences track shows that there are five reads for Neandertal at the position of rs6025, four with a C and one with a T. However, the T is at the very end of a sequence read and can be therefore of questionable quality. So, all in all, there is not enough proof that the T allele was already present in Neandertal. Exercise 4 Orthologues, paralogues and gene trees (human) (a) Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select Search: Human and type long wave sensitive opsin in the for text box. 8 Click [Go].Click on OPN1LW (Human Gene). Note that LW in the gene symbol OPN1LW stands for long-wave. (b) Click on Comparative Genomics - Paralogues in the side menu. Nine within-species paralogues have been identified for the human OPN1LW gene. According to the Target and Query %id, the proteins encoded by the genes ENSG00000166160 (OPN1MW2) and ENSG00000147380 (OPN1MW), i.e. the medium-wave-sensitive (green) opsins, show the highest sequence similarity to red opsin (Target %id indicates the percentage of the sequence of red opsin matching the sequence of the paralogue protein. Query %id indicates the percentage of the sequence of the paralogue protein matching the sequence of red opsin). (c) Click on the Location: X:153,409,698-153,424,507 tab. The OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes are located next to each other on the X chromosome, while the OPN1SW (blue opsin) gene is located on chromosome 7. As females have two X chromosomes a normal gene on one chromosome can often make up for a defective one on the other, whereas males cannot make up for a defective gene. Thus, red-green colour blindness is much more prevalent in males than in females. Variation in the genes for red and green opsin can cause subtle differences in colour perception, while tandem rearrangements due to unequal crossing-over between these genes cause more serious defects in colour vision. (d) Click on the Gene: OPN1LW tab. Click on Comparative Genomics - Gene tree (image) in the side menu. Click on View options: View paralogs of current gene below the gene tree image. Click on the nodes (red squares) for the duplication events that have given rise to the various paralogues. A duplication event on the level of the Catarrhini (Apes and Old World monkeys) has given rise to the OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes. The other paralogues are due

to earlier duplication events. This agrees with the fact that the green opsins show the highest sequence similarity with red opsin (see question b) and the fact that the genes for the red and green opsins are located close to each other on the genome (see question c). Note: On the Paralogues page nine paralogues are shown (see question b). Five of these are of the type other paralogue. These are paralogues that are too distant to be in the same gene tree, but can still be related as part of a broader super-family. Therefore, the gene tree for the OPN1LW gene only shows four of its nine paralogues. The precise taxonomic level of duplication for the other paralogues is left as undetermined. (e) Click on the speciation node (blue square) that is at the base of the complete gene tree. Click on Expand for Jalview in the pop-up menu (that should say Taxon: Chordates ). Click [Start Jalview]. Close the pop-up window with the gene tree. Click on Select > Select all on the menu bar of the popup window with the protein sequence alignment. Click on Calculate > Sort > by ID on the menu bar. Select the protein sequences of the human paralogues. Click on Select > Invert Sequence Selection on the menu bar. Click on Edit > Delete on the menu bar. As the alignment is based on the complete set of protein sequences in the gene tree, the alignment of this subset of five proteins will contain empty columns. These can be removed using the option Edit > Remove Empty Columns on the menu bar. Click on Edit > Remove Empty Columns on the menu bar. Exercise 5 BioMart Go to the Ensembl homepage (http://www.ensembl.org/). Click on the BioMart link on the toolbar. Start with all human Ensembl genes:choose the Ensembl Genes 73 database. Choose the Homo sapiens genes (GRCh37.p12) dataset. Now, filter for the genes on the Y chromosome:click on Filters in the left panel. Expand the REGION section by clicking on the + box. Select Chromosome Y. Make sure the check box in front of the filter is ticked otherwise the filter won t work. Click the [Count] button on the toolbar. This should give you 506 / 63605 Genes. Now filter further for genes that are protein-coding:expand the GENE section by clicking on the + box. Select Gene type protein_coding. Click the [Count] button on the toolbar. This should give you 54 / 63605 Genes. Finally, filter for genes that encode proteins containing one or more transmembrane domains:expand the PROTEIN DOMAINS section by clicking on the + box. Select Transmembrane domains Only.Click the [Count] button on the toolbar. This should give you 4 / 63605 Genes. Specify the attributes to be included in the output (note that a number of attributes will already be selected by default): Click on Attributes in the left panel. Expand the GENE section by clicking on the + box. Select, in addition to the attributes Ensembl Gene ID and Ensembl Transcript ID that are already selected, for instance Associated Gene Name and Description. Have a look at a preview of the results (only 10 rows of the results will be shown):click the [Results] button on the toolbar. If you are happy with how the results look in the preview, output all the results:select View All rows as HTML or export all results to a file. Note: When you select View All rows as HTML, your results will be shown under a new tab or in a new window in your Internet browser.

Although you have filtered for only four genes, your results will contain more than four rows. This is because several of the genes have more than one transcript that encodes for a protein containing one or more transmembrane domains and consequently the results contain a separate row for each of these transcripts.