Orthologue Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs is critical for reliable prediction of gene function in newly sequenced genomes. Homolog A gene related to a second gene by descent from a common ancestral DNA sequence. The term, homolog, may apply to the relationship between genes separated by the event of speciation (see ortholog) or to the relationship betwen genes separated by the event of genetic duplication (see paralog). Paralog Paralogs are genes related by duplication within a genome. Orthologs retain the same function in the course of evolution, whereas paralogs evolve new functions, even if these are related to the original one. The functions of human genes and other DNA regions often are revealed by studying their parallels in nonhumans. To enable such comparisons, HGP researchers have obtained complete genomic sequences for the bacterium Escherichia coli, the yeast Saccharomyces cerevisiae, the roundworm Caenorhabditis elegans, the fruitfly Drosophila melanogaster, the laboratory mouse, and many other organisms. The availability of complete genome sequences generated both inside and outside the HGP is driving a major breakthrough in fundamental biology as scientists compare entire genomes to gain new insights into evolutionary, biochemical, genetic, metabolic, and physiological pathways. HGP planners stress the need for a sustainable sequencing capacity to facilitate future comparisons. What is Comparative Genomics Comparative genomics is the analysis and comparison of genomes from different species. The purpose is to gain a better understanding of how species have evolved and to determine the function of genes and noncoding regions of the genome. Researchers have learned a great deal about the function of human genes by examining their counterparts in simpler model organisms such as the mouse. Genome researchers look at many different features when comparing genomes: sequence similarity, gene location, the length and number of coding regions (called exons) within genes, the amount of noncoding DNA in each genome, and highly conserved regions maintained in organisms as simple as bacteria and as complex as humans. What is Comparative Genomics Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. Some of these sequencesimilarity tools are accessible to the public over the Internet. One of the most widely used is BLAST, which is available from the National Center for Biotechnology Information. 1
Goals Complete the sequence of the roundworm C. elegans genome by 1998. Complete the sequence of the fruitfly Drosophila genome by 2002. Develop an integrated physical and genetic map for the mouse, generate additional mouse cdna resources, and complete the sequence of the mouse genome by 2008. Identify other useful model organisms and support appropriate genomic studies. The complete DNA sequence of the Human Genome is a remarkable achievement for molecular biology and represents the work of many people in a number of large sequencing centers. Far from resting on their laurels, those centers have gone on to sequence the genomes of the mouse, rat, pufferfish, zebrafish, chicken, chimpanzee... you name it they're sequencing it. Why this drive to sequence every animal in the zoo? Do we really care about the genetics of pufferfish? In isolation, not so much, but comparisons with the other genomes yield tremendous insights into the genes that are essential for life and those that define the species. They reveal the mechanisms of evolution and the hidden mechanisms of gene regulation. Geographic maps are a useful analogy for how we study genomes. If you were given a detailed map of London, you could learn a lot about what defines a large cosmopolitan city. You would see a large number of apartments, shops, and restaurants and might reasonably conclude that these are essential for life in the city. But you could not assess the relative importance of unique features like Buckingham Palace or the Brick Lane street market. Things would be clearer if you were also given a detailed map of Paris. That too has apartments, shops, and restaurants, confirming your earlier hypothesis. It also has street markets, so perhaps those are an important, albeit secondary, aspect of city life. In contrast, Paris has no "active" royal palaces. Why not? One interpretation might be that Buckingham Palace is an important feature that distinguishes London from other cities. Another might be that a royal family has no function whatsoever in a modern society and survives in London merely as an evolutionary remnant. Comparing the sequence to a second genome can answer many of these questions. We can compare one with the other, locate conserved sequence segments and assess their significance. The more genomes we have, the more confident we can become of our assignments and the higher the "resolution" at which we can examine the subtleties. 2
Synteny Evolution never makes things simple for biologists. We can't just line up the mouse and human genomes starting at one end of a chromosome and expect to find matching regions one after another. On the time scale of evolution, the process of recombination -- the genetic equivalent of cutand-paste -- is continually at work rearranging the genome. Large blocks of genes are moved around within, and between, genomes. The Software: Genome Browsers To explore comparative genomics we will use the VISTA Genome Browser from Ed Rubin's group at Lawrence Berkeley National Laboratory (LBNL) in Berkeley, Calif. LAGAN and Multi-LAGAN Glocal alignment http://www.tigr.org/software/ References [1] Couronne, O., Poliakov, A., Bray, N., Ishkhanov, T., Ryaboy, D., Rubin, E., Pachter, L., Dubchak, I. 2002. Strategies and Tools for Whole-Genome Alignments. Genome Res. 2003 Jan;13(1):73-80 [2] Brudno M, Do CB, Cooper GM, Kim MF, Davydov E, Green ED, Sidow A, Batzoglou S. LAGAN and Multi-LAGAN: efficient tools for largescale multiple alignment of genomic DNA. Genome Research 2003 Apr;13(4):721-31 [3] Michael Brudno, Sanket Malde, Alexander Poliakov, Chuong Do, Olivier Courone, Inna Dubchak, and Serafim Batzoglou Glocal alignment: finding rearrangements during alignment. Special Issue on the Proceedings of the ISMB 2003, Bioinformatics 19: 54i-62i, 2003. Comparative Genome Databases http://www.hgmp.mrc.ac.uk/genomeweb/ comp-gen-db.html Like other scientific discoveries, genes are named by the person who discovers them. Technically, a scientist can name a gene anything he or she wants. Some scientists choose names based on the disorder thought to be associated with changes in the gene. For example, changes in the CFTR gene cause cystic fibrosis. 3
Some genes are named with abbreviations: After reading about WNT2, RELN, HOXA1, OXTR, and others, you may wonder how genes are named. As you may have guessed, these names are abbreviations for the full gene names. Abbreviated gene names are especially useful for genes with long names. WNT2 was abbreviated from "winglesstype MMTV integration site family member 2"). While "wingless" seems an unnecessary adjective (of course humans don't have wings!), some genes are named after similar genes in other organisms, such as fruit flies. Some scientists choose names based on the disorder thought to be associated with changes in the gene. For example, changes in the CFTR gene cause cystic fibrosis CFTR: Cystic fibrosis transmembrane regulator; Sometimes the gene name is actually a variation of the name of the protein the gene makes. For example, the RELN gene contains instructions for making the 'reelin' protein. The 'reelin' protein was named for the "reeling" walking motion of mice that have changes in their own version of the RELN gene! Other genes are named based on their functions. For example, HOX genes (short for homeobox) are a whole group of genes involved in development. Individual HOX genes are named with additional letters and numbers, such as HOXA1 or HOXD9. However, because of the consistent naming system, we know that all HOX genes play a specific type of role in development. There are even playful gene names, such as the SHH gene, which is involved in the development of the brain, spinal cord, and limbs. The SHH gene is named after Sonic the Hedgehog! 4
International BCB-Workshop on Gene Annotation Analysis and Alternative Splicing http://www.medizin.fuberlin.de/molbiochem/bioinf/konferenz _04/Start.html Candidate gene A candidate gene is a gene that researchers think may be related to a particular disease or condition. Researchers find candidate genes in a variety of different ways, but candidate genes in general may be divided into two categories: positional or functional. Positional candidate genes A positional candidate gene is one that researchers think may be associated with a disorder based on the gene's location on a chromosome. Functional candidate genes Researchers sometimes look at candidate genes that make products that may have something in common medically or biologically with the disorder that they are studying. Technically, scientist often identify functional candidate genes by correlated expression of certain genes and the traits under study. Methods for comparative mapping 1. Linkage mapping of many known genes and then compare with the genes within each linkage group for similar genes and gene orders. Methods for comparative mapping 2. Physical analysis of a large segment of DNA containing known genes in human or other species, and compare if genes, gene order, and orientation of the genes are the same. 5
Methods for comparative mapping 3. Find a group of genes in human located in the same region, conduct in silico Southern blot analysis to determine if the same genes are organized in the similar regions in, e.g., pufferfish or zebrafish. 6