Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat, chicken, puffer fish, zebrafish, Tetraodon Arthropods: the mosquito Anopheles gambiae and Drosophila melanogaster, honeybee Nematodes: Caenorhabditis elegans, C. briggsae Reach the home pages for each species via the generic Ensembl home page (http://www.ensembl.org) or bookmark a species homepage with a URL like http://www.ensembl.org/rattus_norvegicus for those species for which we have an assembly but still being annotated, we have a Preview browser (Pre!) where you can have a look at new assemblies or new species: http://pre.ensembl.org/bos_taurus/ Rat home page Additional animal genomes will be incorporated in the future, with the emphasis on those important for biomedical research or for evolutionary comparisons. At the time of writing, sequencing of the following genomes is underway, and some of these will appear in Ensembl during 2005: 36

Vertebrates: Rhesus macaque, opossum and Xenopus Arthropods: Aedes mosquito A key element of any genome browser is the display of genes in their chromosomal locations. For most species, Ensembl runs an automated sequence annotation pipeline and gene build to provide annotation including genome-wide gene and protein sets. There are different challenges associated with building a comprehensive gene set in different organisms. The Ensembl gene building process is discussed further later in the workshop. For species where the research community is generating comprehensive manual annotation, Ensembl incorporates those gene and protein sets instead of, or in addition too, its own automated annotation. Thus, manual annotation is displayed for some human chromosomes, alongside the Ensembl predictions, and the manually curated genome-wide gene sets for D. melanogaster and C. elegans are used in place of an Ensembl set. The additional types of annotation available will vary to some extent between species. But because annotation is stored and displayed in a consistent way for all species, your experience working with one species transfers to new species, and comparisons of genomic sequence and homologous genes and proteins between species are facilitated. Orthologues One kind of comparative analysis focuses on genes and proteins, and attempts to identify the orthologue (the same gene) in different genomes. Apart from the value of such data in evolutionary studies, it is very useful to be able to identify the equivalent of important genes in one organism (for example, human disease genes) in other organisms that provide experimentally tractable models for studying that gene. The classic model animals are now all represented in Ensembl (Drosophila, C. elegans, mouse) as well as zebrafish, and it is hoped to include other useful models as they become available. The automated identification of orthologues is made more difficult by the existence of families of closely related genes. Under such circumstances, Ensembl may show more than one potential orthologue, and the results need to be treated with caution. Of course there may really be more than one orthologue when a lineage has generated additional family members by duplication after the divergence of the two organisms under consideration. See the section below on synteny blocks for one way that Ensembl can help you to assess the orthologue pairs. In Ensembl, orthologues are identified starting with comparisons at the protein level. Allversus all BLASTP+SW (Smith-Waterman algorithm) is first used to identify those protein pairs that are reciprocal best hits between two sets of proteins that represent every gene in the two organisms. Additional putative orthologues are then sought using synteny information and the reciprocal best hits as RHS for Reciprocal Hit supported by Synteny. Where two homologous proteins are encoded by genes each located within 1 Mb of a pair of BRH, they are good candidates for being an additional orthologous pair. Currently we divide these BRH into UBRH Unique Best Reciprocal Hit and MBRH Multiple Best Reciprocal Hit the latter when have multiple but identical best hits, as it can happen if there is perfect protein sequence duplication of translated genes within a species. The same approach permits the 37

identification of adjacent family members that may be recently duplicated lineage-specific paralogues. Ensembl shows the information about potential orthologues on each GeneView page (and also in SyntenyView displays of synteny blocks, where these are available). The procedure has been applied to all pairs of vertebrates within Ensembl, to the two nematodes, and to the two insects. ------------------------------------------------------------------------------------ Parts of a GeneView page, showing putative orthologues EnsMart lets you access and use this orthology information in a variety of ways. A set of genes can be selected that have identified orthologues in another species, and further restricted to those that share conserved upstream sequence. For any set of genes, output can include the details of any orthologues in other species and the locations of conserved upstream sequence. Protein families What if the orthologue identification procedure fails to find a pair? And what if you are interested in looking at a wider set of potential orthologues and paralogues between two species or within a wider range of organisms? One option is to look for proteins that share particular domains. Ensembl runs domain prediction programs on all its protein sets, and provides access to this information in ProteinView (for individual proteins), and in DomainView (showing all the genes in a species that share a particular InterPro domain). However, some domains are shared by a wide range of proteins that have very different functions. Ensembl s protein families are an attempt to identify clusters of functionally related proteins, among which one might expect to find most orthologues and paralogues. The family database is generated by running the Tribe-MCL sequence clustering algorithm on a set of peptides consisting of the Ensembl predictions for each Ensembl species, together with all metazoan sequences from Swiss-Prot and SPTrEMBL. On this set of peptides, an all-against-all BLASTP is run to establish similarities. Using these similarities, clusters can be established using the MCL algorithm. [For more detail of the underlying 38

methods, see Enright, A.J. et al. (2002) An efficient algorithm for large-scale detection of protein families Nucleic Acids Res. 30, 1575-1584]. The efficiency of this approach permits clustering to be done within a realistic time despite the very large numbers of proteins involved (for a recent release, around half a million protein sequences were loaded and processed in less than 24 hours, using 400 CPUs). --------------------------------------------------------------------------------- Domain and Family information on a ProteinView page Both family and domain information are shown in Ensembl GeneView and ProteinView pages, and DomainView and FamilyView pages make it easy to examine all the identified genes and proteins within a species. For families, you can also see the family members in other Ensembl species and the UniProt entries (from all metazoans) that fall in the same cluster. In the future, we plan to introduce a similar multi-species display for protein domains. EnsMart provides the means to rapidly and easily download sets of transcript or protein sequence with particular domains or from particular families, which can be very useful as starting points for alignment and phylogenetic analysis. In addition, the Ensembl database stores pre-calculated protein alignments for all members of a family, and these alignments can be displayed in JalView. 39

Part of a family alignment in JalView Whole genome DNA-DNA alignments The alignment of the whole DNA sequence from two organisms is computationally demanding, and the algorithms to carry it out are under active development [see for example Ureta-Vidal, A. et al. (2003) Comparative genomics: genome-wide analysis in metazoan eukaryotes Nat. Rev. Genet. 4, 251-262]. Such data are of great interest both in studies of the mechanisms of molecular evolution and in attempts to identify conserved functional sequences such as novel genes and regulatory regions. Whole genome alignments become increasingly difficult as the evolutionary distance between two organisms increases. At present, Ensembl displays pair-wise alignments within mammals (human, mouse and rat), and within nematodes (C. elegans and C. briggsae); within these species groups, separation probably occurred <100 Mya. Ensembl is experimenting with different procedures to do the alignments: at present conserved regions are by identified either by BLASTz (data obtained from UCSC Genome Bioinformatics group) or by Phusion/BLASTn (used for C. elegans and C. briggsae); this firstly runs the two genomes through Phusion, which takes unique 17mers from one genome and compares them to the second genome, creating clusters of contigs from both genomes; and then comparing contigs within clusters using Washington University's version of BLASTn (without repeat masking). The output of wublastn is postprocessed to keep only high-scoring pairs and to identify diagonals of blast alignments. Regions of these alignments that represent highly conserved regions are then selected using a filtering method devised by Jim Kent [see Schwartz, S. et al. (2003) Human-Mouse Alignments with BLASTZ Genome Res. 13, 103-107]. A third method, translated BLAT is used to compare genomes from more evolutionarily distant species, at the amino acid level. Thus regions of similarity will be biased towards those that code for proteins, although highly conserved non-coding regions might be detected as well. You can show a number of tracks (e.g. human vs. chimpanzee, human vs. rat, human vs. mouse, human vs. chicken, human vs. Fugu, human vs. zebrafish, human vs. Tetraodon, etc) displaying the conservation within the cluster (vertebrates, arthropods and nematodes) clustered d/or Caenorhabditis elegans vs. C. briggsae) from the Compara menu in ContigView 40

for each comparison, showing two levels of conservation (labelled cons for BLASTz or Phusion/BLAST comparisons and high cons for highly conserved). Links make it easy to navigate back and forth to see details of the region in the two genomes and to download the sequence of regions of interest. Part of a human ContigView detailed display panel, showing whole genome alignments with mouse and rat. Further access to the conserved sequences at Ensembl is provided via DotterView displays of local alignments, while EnsMart provides a route for identifying specifically those highly conserved sequences that are located in regions upstream of pairs of orthologous genes. DotterView display showing two homologous exons Synteny blocks The identifications of segments of the genome where the order of particular genes is conserved between two species ( synteny blocks ) is of interest not only for studying the evolution of chromosome structure, but also for helping to predict and identify pairs of genes between species that are (or are not!) orthologues. Where candidate orthologues in two 41

species are found to be located within well-conserved synteny blocks, you can have more confidence that the pair have been correctly labelled as orthologues. Ensembl finds synteny blocks by grouping the conserved regions identified from genomic alignments. To be grouped, matches must represent the same relative orientation of chromosome sequences and must be separated by less than 100 kb. Groups that make up a block of less than 100 kb are discarded (parameters may be varied for different species). The approach requires that the species are close enough for genomic alignments to be attempted. Part of a human CytoView display of human chromosome 11, showing synteny blocks conserved on chimpanzee, mouse, rat and chicken. Ensembl shows the synteny blocks in ContigView (overview panel) and CytoView displays for mammalian and nematode comparisons. In CytoView only, the blocks provide links to display that region in the other species. In addition, SyntenyView shows all blocks on a whole chromosome, related to the conserved blocks in a second species. (SyntenyView is not available for the C. elegans - C. briggsae comparison, as the C. briggsae genome sequence has not yet been assembled into chromosomes.) Links provide navigation between blocks and between species, and the display also shows genes in the current block together with their putative orthologues in the second species. 42

Comparative genomics and proteomics in Ensembl - examples a) From human gene to putative mouse orthologue and to Ensembl protein family Human CFTR GeneView Link to mouse homologue Link to FamilyView (human) 43

b) FamilyView Alignments in JalView Family members in home Family members in other species 44

c) SyntenyView Human chromosome 8 surrounded by mouse chromosomes with conserved synteny blocks. Human genes in the selected block, together with their putative mouse orthologues 45

Comparative genomics and proteins in Ensembl - exercises Main exercise: Explore a protein family in human, mouse and rat, identify putative orthologues, and explore regions of conserved synteny. 1. Find the GeneView page for human SNX5 (Ensembl gene), and scroll down to the first Transcript/Translation Summary. 2. Take the link to the associated Protein Family. How many human genes produce proteins in this family? Are they all known genes? Are there members of the same family in mouse, rat and zebrafish? How many? What about invertebrate species? Click on one of the rat peptides and go to rat ProteinView. From there take the link to the corresponding rat FamilyView. How many rat genes are part of this family? Find your way to mouse FamilyView, and follow the link to mouse Sorting Nexin 5 (GeneView). Have a look at the section Orthologue Predictions. Follow the link to human SNX5 (this takes you back to where you started). 3. Examine the genomic context of the human and mouse genes. From human SNX5 GeneView, follow the link View gene in genomic location to ContigView. Which chromosomal region is the human gene located in? Customise the display of ContigView. Select only Ensembl Trans., mouse (Mm) cons. and high cons. and rat (Rn) cons. and high cons.; deselect all other options. Have a look at the mouse and rat conserved regions in relation to the human Ensembl transcript. Note that there the correspondence with exons, but note also that this is not perfect. Zoom in to examine in more detail. The conserved regions are probably showing grouped (a red - shows to the left of the track label). Note that pointing to a region produces a pop-up with details of and a link to that region in the other species. Click on the red - to the left of the Mm cons. track: ContigView will reload, the red + replaces the - and the hits are now ungrouped. Point to a mouse match in this track, and take the link to DotterView. Note the dots on the diagonals where exons align. Zoom in to examine a smaller region. Go back to human ContigView, point to a mouse match, and this time take the link to Jump to Mus musculus. This takes you to the corresponding display in mouse ContigView. In which chromosomal region is the gene located? Zoom and/or customise the ContigView display to focus on the mouse Snx5 transcript, and turn on the rat and human matches tracks if necessary. Compare the amount of sequence showing as matched (the same threshold Blast score is used). 46