Ensembl Exercise Answers Adapted from Ensembl tutorials presented by Dr. Bert Overduin, EBI Exercise 1 Exploring the human MYH9 gene (a) Go to the Ensembl homepage (http://www.ensembl.org). Select Search: Human and type MYH9 gene Click [Go]. Click on Homo sapiens on the page with search results. Click on Gene. Click on Ensembl protein_coding Gene: ENSG00000100345 (HGNC Symbol: MYH9). Chromosome 22 on the reverse strand. Ensembl has 11 transcripts annotated for this gene. Three transcripts are protein coding. The longest transcript is MYH9-001 and it codes for a protein of 1960 amino acids MYH9-001 has a CCDS record. CCDS is the consensus coding sequence set. These coding sequences (CDS) have been agreed upon by Ensembl, NCBI, UCSC and Havana. The CCDS set is a collection of reviewed, agreed-upon coding sequences (for human mouse). These sequences are high- confidence, and unlikely to change in the future. (b) These are some of the phenotypes associated to MYH9 according to MIM: autosomal dominant deafness, Epstein syndrome, and Fechtner syndrome. Click on any of these for more information in the MIM record itself. (c) Click on ENST00000216181 It has 41 exons. This is shown in the Transcript summary. Click on Exons in the side menu. Exon 1 is completely untranslated, and exons 2 and 41 are partially untranslated (UTR sequence is shown in purple). You can also see this in the cdna view. Click on General identifiers in the side menu. MYH9-HUMAN from Swiss-Prot matches the Ensembl transcript. Click on it to go to UniProtKB, or click align for the alignment between the Ensembl translation and the Swiss-Prot record. Have a look at Ontology table. The Gene Ontology project (http://www.geneontology.org/) maps terms to a protein in three classes: biological process, cellular component, and molecular function. Meiotic spindle organization, cell morphogenesis, and cytokinesis are some of the roles associated with MYH9-001. (d) Click on Oligo probes in the side menu. Probesets from Affymetrix, Agilent, Codelink, Illumina, and Phalanx match to this transcript sequence. Expression analysis with any of these probesets would reveal information about the transcript. Hint: this information can sometimes be found in the ArrayExpress Atlas: www.ebi.ac.uk/arrayexpress/ Exercise 2 - Exploring a genomic region in human (a) Go to the Ensembl homepage (http://www.ensembl.org/). Select Search: Human and type 13:32448000-33198000 in the for text box (or alternatively leave the Search drop-down list like it is and type human 13:32448000-33198000 in the for text box). Click [Go]. This genomic region is located on cytogenetic band q13.1. It is made up of seven contigs, indicated by the alternating light and dark blue coloured bars in the Contigs track. (b) Draw with your mouse a box encompassing the BRCA2 transcripts. Click on Jump to region in the pop-up menu. (c) Click [Configure this page] in the side menu. Type clones in the Find a track text box. Select 1Mb clone set, 32k clone set and Tilepath. Click ( ).There is not one single clone only that contains the complete BRCA2 gene. For example clone RP11-37E23 contains most of the gene, but not its very 3 end. This was reflected on the two contigs needed to make up the BRCA2 gene (the Contigs
track is on by default). (d) Click [Configure this page] in the side menu. Type refseq in the Find a track text box. Select Human RefSeq import Expanded with labels. Click ( ).Click on individual transcript models (RefSeq or otherwise) to retrieve more information about them.there has been one transcript annotated by RefSeq for the BRCA2 gene, i.e. NM_000059.3. This transcript is almost identical to Ensembl transcript BRCA2-001 (ENST00000380152). Both encode a 3418 aa protein but the RefSeq transcript is shorter at the 5 UTR and longer at the 5 UTR. (e) Click [Export data] in the side menu. Click [Next>]. Click on Text.Note that the sequence has a header that provides information about the genome assembly (GRCh37), the chromosome number, the start and end coordinates and the strand. For example:>13 dna:chromosome chromosome:grch37:13:32883613:32978196:1 (f) Click [Configure this page] in the side menu. Click [Reset configuration]. Click ( ). Exercise 3 Exploring a sequence variant (human) (a) Go to the Ensembl homepage (http://www.ensembl.org/). Select Search: Human and type f5 in the for text box. Click [Go]. Click on Variation table under F5 (Human Gene). Click on Show for Missense variant in the Summary of variation consequences in ENSG00000198734 table. Type 534 in the Filter text box. The dbsnp accession number for the Arg534Gln (Q/R) variant is rs6025. Note that HGVS (Human Genome Variation Society) notations are not by default shown in the table. They can be added as follows: Click on Configure this page in the side menu. Click on Consequence options. Check Show HGVS notations. Click( ) (b) rs6025 is supported by all six possible types of evidence (represented by icons), i.e. Multiple observations (the variant has multiple independent dbsnp submissions, i.e. submissions with different submitter handles or different discovery samples), Frequency (the variant is reported to be polymorphic in at least one sample), HapMap (the variant is polymorphic in at least one HapMap panel), 1000 Genomes (the variant was discovered in the 1000 Genomes Project), Cited (dbsnp holds a citation from PubMed for the variant) and ESP (the variant was discovered in the Exome Sequencing Project). (c) Click on rs6025.no, rs6025 is missense for two F5 transcripts. It is 3 prime UTR for one F5 transcript, i.e. ENST00000546081. Note that in total four transcripts have been annotated for the F5 gene: http://www.ensembl.org/homo_sapiens/gene/summary?db=core;g=ensg00000198734. (d) In Ensembl the alleles of rs6025 are given as T and C, because these are the alleles in the forward strand of the genome. In dbsnp the alleles are given as A and G because the person(s) who submitted this variant apparently had sequenced the reverse strand of the genome. In literature the alleles are mostly given as A and G, because the F5 gene is located on the reverse strand of the genome, thus the alleles in the actual gene and transcript sequences are A and G. (e) Ensembl puts the allele that is present in the GRCh37 reference genome first, i.e. T (forward strand). In the case of rs6025 this is the minor allele. That the reference genome can contain the minor allele for a variant is because it is an amalgamation of the genomes of just a few individuals and not a reference in the sense of a representation of what is most common in the human population as a whole. In the literature normally the major allele (in the population of interest) is put first.
(f) rs6025 is predicted to be tolerated and benign according to SIFT and PolyPhen, because they predict the effect of the change from reference allele to alternate allele, i.e. from T (minor allele) to C (major allele). (g) Click on Population genetics in the side menu. Yes, there is ethnic variation in the frequency of the T allele. Among the 1000 Genomes populations studied, it ranges from 0 in the various African and East Asian populations to 0.029 in the CEU (Utah Residents (CEPH) with Northern and Western European ancestry) population. (h) Click on Phenotype Data in the side menu. rs6025 has been associated with a number of different phenotypes, i.e. venous thromboembolism, susceptibility to Budd-Chiari syndrome, recurrent abortion, thrombophilia due to activated protein C resistance, thrombophilia due to factor V Leiden and susceptibility to ischemic stroke. (i) Click on Phylogenetic Context in the side menu. Gorilla, orangutan, macaque and marmoset all have a C in this position, which confirms that C is indeed the ancestral allele. (j) Go to the Neandertal Genome Browser (http://neandertal.ensemblgenomes.org/).type rs6025 in the Search Neandertal text box. Click [Go]. Click on rs6025. Click on Jump to region in detail. Click on Configure this page in the side menu. Click on Variation features. Select All variations Normal. Click [SAVE and close]. Draw a box of about 50 bp around rs6025 (shown in yellow in the center of the display). Click on Jump to region in the pop-up menu. The Sequences track shows that there are five reads for Neandertal at the position of rs6025, four with a C and one with a T. However, the T is at the very end of a sequence read and can be therefore of questionable quality. So, all in all, there is not enough proof that the T allele was already present in Neandertal. Exercise 4 Orthologues, paralogues and gene trees (human) (a) Go to the Ensembl homepage (http://www.ensembl.org/). 8 Select Search: Human and type long wave sensitive opsin in the for text box. 8 Click [Go].Click on OPN1LW (Human Gene). Note that LW in the gene symbol OPN1LW stands for long-wave. (b) Click on Comparative Genomics - Paralogues in the side menu. Nine within-species paralogues have been identified for the human OPN1LW gene. According to the Target and Query %id, the proteins encoded by the genes ENSG00000166160 (OPN1MW2) and ENSG00000147380 (OPN1MW), i.e. the medium-wave-sensitive (green) opsins, show the highest sequence similarity to red opsin (Target %id indicates the percentage of the sequence of red opsin matching the sequence of the paralogue protein. Query %id indicates the percentage of the sequence of the paralogue protein matching the sequence of red opsin). (c) Click on the Location: X:153,409,698-153,424,507 tab. The OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes are located next to each other on the X chromosome, while the OPN1SW (blue opsin) gene is located on chromosome 7. As females have two X chromosomes a normal gene on one chromosome can often make up for a defective one on the other, whereas males cannot make up for a defective gene. Thus, red-green colour blindness is much more prevalent in males than in females. Variation in the genes for red and green opsin can cause subtle differences in colour perception, while tandem rearrangements due to unequal crossing-over between these genes cause more serious defects in colour vision. (d) Click on the Gene: OPN1LW tab. Click on Comparative Genomics - Gene tree (image) in the side menu. Click on View options: View paralogs of current gene below the gene tree image. Click on the nodes (red squares) for the duplication events that have given rise to the various paralogues. A duplication event on the level of the Catarrhini (Apes and Old World monkeys) has given rise to the OPN1LW (red opsin) and OPN1MW and OPN1MW2 (green opsin) genes. The other paralogues are due
to earlier duplication events. This agrees with the fact that the green opsins show the highest sequence similarity with red opsin (see question b) and the fact that the genes for the red and green opsins are located close to each other on the genome (see question c). Note: On the Paralogues page nine paralogues are shown (see question b). Five of these are of the type other paralogue. These are paralogues that are too distant to be in the same gene tree, but can still be related as part of a broader super-family. Therefore, the gene tree for the OPN1LW gene only shows four of its nine paralogues. The precise taxonomic level of duplication for the other paralogues is left as undetermined. (e) Click on the speciation node (blue square) that is at the base of the complete gene tree. Click on Expand for Jalview in the pop-up menu (that should say Taxon: Chordates ). Click [Start Jalview]. Close the pop-up window with the gene tree. Click on Select > Select all on the menu bar of the popup window with the protein sequence alignment. Click on Calculate > Sort > by ID on the menu bar. Select the protein sequences of the human paralogues. Click on Select > Invert Sequence Selection on the menu bar. Click on Edit > Delete on the menu bar. As the alignment is based on the complete set of protein sequences in the gene tree, the alignment of this subset of five proteins will contain empty columns. These can be removed using the option Edit > Remove Empty Columns on the menu bar. Click on Edit > Remove Empty Columns on the menu bar. Exercise 5 BioMart Go to the Ensembl homepage (http://www.ensembl.org/). Click on the BioMart link on the toolbar. Start with all human Ensembl genes:choose the Ensembl Genes 73 database. Choose the Homo sapiens genes (GRCh37.p12) dataset. Now, filter for the genes on the Y chromosome:click on Filters in the left panel. Expand the REGION section by clicking on the + box. Select Chromosome Y. Make sure the check box in front of the filter is ticked otherwise the filter won t work. Click the [Count] button on the toolbar. This should give you 506 / 63605 Genes. Now filter further for genes that are protein-coding:expand the GENE section by clicking on the + box. Select Gene type protein_coding. Click the [Count] button on the toolbar. This should give you 54 / 63605 Genes. Finally, filter for genes that encode proteins containing one or more transmembrane domains:expand the PROTEIN DOMAINS section by clicking on the + box. Select Transmembrane domains Only.Click the [Count] button on the toolbar. This should give you 4 / 63605 Genes. Specify the attributes to be included in the output (note that a number of attributes will already be selected by default): Click on Attributes in the left panel. Expand the GENE section by clicking on the + box. Select, in addition to the attributes Ensembl Gene ID and Ensembl Transcript ID that are already selected, for instance Associated Gene Name and Description. Have a look at a preview of the results (only 10 rows of the results will be shown):click the [Results] button on the toolbar. If you are happy with how the results look in the preview, output all the results:select View All rows as HTML or export all results to a file. Note: When you select View All rows as HTML, your results will be shown under a new tab or in a new window in your Internet browser.
Although you have filtered for only four genes, your results will contain more than four rows. This is because several of the genes have more than one transcript that encodes for a protein containing one or more transmembrane domains and consequently the results contain a separate row for each of these transcripts.