Comparative genomics Lucy Skrabanek ICB, WMC 6 May 2008
What does it encompass? Genome conservation transfer knowledge gained from model organisms to non-model organisms Genome evolution understand how genomes change over time in order to identify evolutionary processes and constraints Genome variation understand how genomes vary within a species to identify genes central to particular processes
Main uses Whole genome comparisons Genome evolution Coding regions comparisons Gene prediction Gene structure (exon-intron) prediction Function prediction Non-coding region comparisons Regulatory region discovery Protein-protein interaction prediction
Metazoan evolutionary tree Ureta-Vidal et al, NRG 2003
Rearrangement rate Can estimate the number of rearrangements from cytological comparisons Two very different rates: Very slow rate of rearrangement (1 or 2 exchanges per 10 MYR) ~ 7 rearrangements between the human genome from the hypothetical primate ancestor 13 rearrangements between cat and human
Rearrangement rate Punctuated by abrupt global genome rearrangement in some lineages Gibbons and siamangs rearranged 3 or 4 times more extensively than human or other great apes Rodent species exhibit very rapid patterns of chromosome change ~180 conserved segments between mouse and human ~100 conserved segments between rat and human
Human, cat and mouse X chromosome comparison O Brien et al, Science 1999
Genome comparison across species gross changes in chromosome number O Brien et al, Science 1999
Genome alignment Very different from single gene or protein alignments Standard DPAs are too expensive Made complicated by extensive rearrangements of large homologous segments
Problems Looking for syntenic regions Rearrangements disrupting syntenic regions Insertions Deletions Inversions Translocations Duplications
Assumptions The two genomes to be aligned derived from a common ancestor There remains sufficient similarity between the genomes to enable easy identification of homologous regions For the alignment to be informative, there has to have been time for the genomes to diverge and for selection to have occurred
Algorithm requirements Genome alignment algorithms must Scale linearly (computationally) Be robust (not too many parameters) Be memory-efficient Be able to handle rearrangements, gene duplications, repetitive elements Smith-Waterman, Needleman Wunsch Time to do calculation of the order of (O) 4 Not feasible for sequence length > 10,000 bp Cannot handle rearrangements or inversions
Alignment methods Seeding methods (e.g., BLASTZ, BLAT, EXONERATE) Produce local alignments All matches found (including all paralogs) Very sensitive, not very specific Very fast Anchor-based methods (e.g., MUMmer, GLASS, AVID) Produce global alignments Specific Difficulty with rearrangements Ureta-Vidal et al, Nat Rev Genet April 2003
Local aligner: BLASTZ As in BLAST, find seeds Seeds are determined by a 12of19 weightedspaced seeds strategy where the positions for strict matches are specified 1110100110010101111 combination shown to be the most sensitive (and more sensitive than the 11-consecutive match seed strategy used by BLAST) Also allows one transition in 1/12 strict match positions Seeds with many matches masked out (assumed to be repetitive regions)
BLASTZ Gap-free extension Further extend seeds by DPA, allowing gaps Low complexity regions have their scores specifically down-weighted Repeat above steps (using a more sensitive seed, e.g., 7-mer) for regions that lie between matches, that share the same order and orientation, and are separated by <50 kb Post-processing to remove multiple hits to the same region When aligning human and mouse genomes, can achieve 98% alignment of known coding regions
Local aligner: BLAT Two modes: untranslated and translated Untranslated mode performs poorly when conservation < 90% so translated mode usually used when aligning genomes Will more efficiently identify regions that are conserved for their coding ability rather than for the regulatory functions Seeds created by building an index of nonoverlapping 5-mers from one genome Frequent 5-mers and ambiguous sequences (repeats and low-complexity regions) excluded The other genome is chopped into <200kb chunks Comparisons are made between the indexed 5-mers and the chunks DPA applied both upstream and downstream of matches
Global aligner: AVID Assumes that no gene duplications, inversions or translocations have occurred Need a pre-processing step to identify syntenic regions - use BLAT hits, filtered for specificity Find maximally repeated matches in syntenic regions Matches are flanked on either side by mismatches Determine an anchor set Exclude all maximally repeated matches that are less than half the length of the longest match Allow only non-overlapping non-crossing matches Use clean matches first, then repeat matches
Bray et al, Genome Res 2003 Blue + Red: all maximally repeated matches Red: anchor set n anchors n+1 regions to be aligned Add any matches that were discarded because they were too short on the first round Add any new anchors When there are no more anchors to be added, use the NW algorithm to align the intervening sequence If the regions are longer than ~ 4kb, the alignment is returned with a gap in both sequences at that position
Global aligner: LAGAN Different matching algorithm to AVID Generates local alignments Allows mismatches Find highest weight chain Compute NW alignment, limiting it to area around chains MLAGAN - multiple global alignment tool
Global aligner: WABA Takes into account degeneracy of genetic code Seeding step similar to BLAT Uses the two weighted-spaced seed 6of8 11011011, where the third position is allowed to mismatch In the alignment of homologous regions, uses a seven-state pair HMM which also allows mismatches in the third position Finally, join overlapping matches from the alignment phase
Visualization tools: VISTA vs. PipMaker VISTA (VISualization Tools for Alignments) Uses AVID to generate alignments Uses a sliding window approach Plots percent identity within a fixed window size, at regular intervals PipMaker (Percent Identity Plot) Uses BLASTZ X-axis is the reference sequence; horizontal lines represent gap-free alignments
Once we have a macro-alignment, we can study the evolution of genomes between species, and also can trace the evolutionary history of the structure of each genome itself Genome structure rearrangement Gene duplication, chromosomal duplication and polyploidization (whole genome duplication) New genes
Unravelling the history of the genome Plants Wheat Allohexaploid (AABBCC) Maize Diploidized allotetraploid Has grown 12x in the past 5 MY due to increased numbers of transposable elements Arabidopsis Haploid, 5N Yeast Saccharomyces cerevisiae vs. Kluyveromyces lactis Human Genome duplication?
Decipher history of lineages Genome size changes Compaction Large scale deletions - Fugu Intron loss Expansion Baxendale et al, Nature Genet, 1995 Duplication Transposable element insertions Transposons - Alus, LINEs Ancient insertions prior to eutherian radiation More recent insertions - maize
Why gen(om)e duplication? Duplicated genes provide a source for genetic novelty during evolution Either member of a duplicated gene pair can diverge to either Acquire a new function which may be positively selected for Subfunctionalize Be differently regulated (e.g., tissue-specific) Become a pseudogene Be deleted Whole genome duplication allows for the duplication (and subsequent divergence) of whole pathways at a time
Duplication events Resolution of polyploidy Non-disjunction of chromosomes (form multivalents instead of divalents) Sterility Duplicated genes do not start to diverge (or get deleted) until disomic inheritance resolved
Rearrangements Are genome rearrangements the cause or consequence of diploidization? Most widely accepted hypothesis is that diploidization proceeds by structural divergence of chromosomes Some loci appear disomic, others tetrasomic Stage 1: pairing between similar chromosomes allowed Loci near the centromeres can display residual tetrasomy Stage 2: non-homologous chromosome pairing resolved
Just how hard can it be to tell what happened? Take four, or maybe eight, decks of 52 playing cards. Shuffle them all together and then throw some cards away. Pick 20 cards at random and drop the rest on the floor. Give the 20 cards to some evolutionary biologists and ask them to figure out what you ve done. Skrabanek and Wolfe, Curr Opin Genet Dev, 1998 Now that the human genome has been sequenced, things aren t quite so bleak.
Effects of polyploidization Wolfe, Nature Rev Genet 2001
Effects of polyploidization Wolfe, Nature Rev Genet 2001
Saccharomyces cerevisiae (baker's yeast)
Whole Genome Duplication (WGD) WGD in S. cerevisiae: Wolfe & Shields, Nature 387:708 (1997)
Yeast Saccharomyces cerevisiae Degenerate tetraploid Polyploidy followed by extensive deletion and (70-100) reciprocal translocations 8% of genes duplicated in 55 blocks (plus many missed smaller blocks) Relative orientation of genes in blocks conserved with respect to the centromere Seoighe and Wolfe, PNAS, 1998
Duplicated blocks in yeast Wolfe and Shields, Nature 1997
Estimation of time of polyploidy event Diverged from Kluyveromyces lactis (unduplicated) ~150 MYA Comparison of gene sequences and gene order reveals conservation 59% of adjacent gene pairs in K. lactis or K. marxianus are also adjacent in S. cerevisiae 16% of Kluveromyces neighbors can be explained in terms of inferred ancestral gene order Phylogenetic analyses of duplicated genes where both the Kluveromyces orthologue and an outgroup orthologue were available, deduced that the polyploidization event in S. cerevisiae occurred around 100 MYA
5,000 genes 10,000 genes 10% retention 5,500 genes Different subsets retained 5,000 genes Evidence from conserved order of a very few genes Evidence from interleaving genes from sister segments Kellis et al Nature 428:617-624 (2004)
Each region of K.waltii matches two regions of S.cerevisiae We don t even need any remaining twocopy genes to infer the ancestral order
Human 1970 - Susumu Ohno proposed that vertebrate genomes had originated via genome duplication 2R (two Rounds [of genome duplication]) hypothesis One before, and one after divergence of agnathans (lamprey) from tetrapods (~430 MYA and ~500 MYA) Popular and controversial Split between the map-based people and the treebased people Duplicated regions are evident (covering 44% of the genome), but it is difficult to tell whether it is due to (a) genome duplication(s) or chromosomal duplications
Susumu Ohno 1928-2000 Book: Evolution by Gene Duplication (1970) Whole-genome duplication (polyploidy)
Timing of tetraploidy events Skrabanek, PhD thesis, 1999
Evidence? There are regions in the genome which are quadruplicated HSA 1, 6, 9 and 19 (MHC region, 10 gene families) HSA 4, 5, 8 and 10 HSA 2, 7, 12 and 17 (Hox clusters) Expect to see (A,B)(C,D) tree, where the time of divergence of A from B and C from D is approximately the same However, this is not consistently the case
Proposed evolution of the Hox clusters Skrabanek, PhD thesis, 1999
Possible routes to 4- gene families Hokamp et al, J Struct Func Genomics 2003
Estimates of divergence times for genes in the MHC region on chromosomes 1, 6, 9 and 19 Hughes, Mol Biol Evol 1998
Hughes et al, Genome Res, 2001 Divergence times of genes with members on at least two of the Hox cluster bearing chromosomes (2, 7, 12, 17)
Phylogenetic relationships of four- and threemembered gene families on Hox cluster bearing chromosomes Hughes et al, Genome Res, 2001
Conclusions Duplicated regions seen in human genome are most likely vertebrate specific Significant amount of duplication occurring ~350-650 MYA Possible to explain large margin by alloploidy? Probable support for at least one whole genome duplication event Once more genomes are available, such as Ciona intestinalis or amphioxus, it may be easier to decipher the history of duplication However, long time spans under consideration, and diploidization requires extensive genomic changes
PLOS Biology 3:e314 (2005)
Fish-specific genome duplication event Panopoulou and Poustka, TIG 21:559, 2005
Lander et al - Intl. Human Genome Sequencing Consortium paper Nature 409:860 (2001)
Hypothetical genomic region 1R decay 2R decay
Finding all homologs, and only homologs Find most similar vertebrate gene (here M1) to the Ciona gene. Other vertebrate genes are added to the cluster if they are more similar to M1 than M1 is to the Ciona gene. > S1 S1 < S1 Fugu-specific duplication Duplicates may have arisen by speciation (lineage-splitting) or by gene duplication events specific to one or more vertebrates Human-specific duplication
Number of gene duplications 46.6% of ancestral chordate genes are duplicated in 1 lineage 34.5% with at least one duplication before Fugu-tetrapod split 23.5% with at least one duplication after Fugu-tetrapod split No evidence of 2R hypothesis from gene family membership alone, nor from phylogenetics (since duplications abundant on every branch)
Paralogs generated by a gene duplication before the Fugu-tetrapod split count as matches. N-fold redundancy calculated by identifying all cases where 2 different duplicates are within a 100-gene window, and then counting the number of times that their paralogs occur within a 100-gene window elsewhere in the genome. 2R
4-fold redundancy most common - accounts for 25% of the genome
1,912 genes duplicated prior to fish-tetrapod split 2,953 paralogous gene pairs 32.4% are found in 386 detectable paralogous segments, comprising 772 individual genomic segments 454 are tetra-paralogons, where overlapping sets fall into 4-fold groups Of the genes that duplicated after the fish-tetrapod split, only 11% are found in paralogous regions (i.e., duplications after the split did not include large segments of the genome)
Could be of recent origin, or could have undergone multiple rearrangement events that destroyed the tetra-paralogons signal
Old duplications Recent duplications Hox-bearing chromosomes: 50% of genes duplicated after the fish-tetrapod split are tandem duplicates (separated by < 10 genes), whereas only 6% of genes duplicated before the split are tandem duplicates.
Conclusions 2R hypothesis most likely scenario Two rounds of closely spaced autotetraploidization events Some paralogous pairs within tetraparalogons extend over longer regions than others So octaploidy unlikely, since pairs of regions would have had to lose the same sets of genes Phylogenetic trees are not consistently nested So allotetraploidy or two distantly spaced autotetraploidy events unlikely Tree topologies within paralogous blocks not always congruent Gene loss and rediploidization processes probably spanned the two duplication events
Identification of functional elements Coding sequences Relatively easy to identify Many gene prediction programs available General gene structure known, e.g., TATA boxes Splice donor-acceptor sites ESTs and cdnas available to aid gene prediction Non-coding functional sequences Much harder to identify No common structure of regulatory regions TF binding sites are short and ubiquitous Comparative genomics A genomic sequence that provides a function that is under selection and tends to be conserved between species is called a functional sequence (e.g., a protein-coding region or a transcription factor binding site)
Coding region comparison Discover new genes Annotate gene structure Exon-intron structure Compare gene content Find genes common to sets of organisms Find genes unique to an organism We can study how genomes vary within a species to identify genes central to particular processes Can also compare subspecies e.g., E. coli K12 and O157:H7 (pathogenic) Discover gene function Can find missing genes in metabolic pathways
Nature 420:520, 2002
Predicting structure of new genes Many gene finding programs Look for start sites, termination sites, splice sites Analyze codon usage, exon length Comparative method A Find syntenic regions Predict genes using conventional gene finding techniques in both species Genes predicted in both species are probable
Gene prediction Comparative method B Find syntenic regions Perform pattern filtering Coding exons tend to be well conserved Conservation higher in first and second positions of codons Advantage: can deal with sequencing errors
Identification of paralogues and orthologues in other species Best Reciprocal Hit (BRH) Mouse Human http://www.nbn.ac.za/ Reciprocal Hit by Synteny (RHS) Identification of adjacent orthologues Domain checking - internal quality control
Identification of orthologues BRH RHS Human Others Orphan Mouse Matches to some other chromosome http://www.nbn.ac.za/
Genes in mouse shared with other organisms Nature 420:520, 2002
Non-coding region comparison Regulatory region discovery No accurate methods to identify regulatory sequences based on scanning the DNA sequence alone Putative TF binding sites found everywhere Low specificity Assume that regulatory regions are conserved because they serve a vital function
Rationale DNA sequences encoding and regulating the expression of essential proteins and RNAs will be conserved Consequently, the regulatory profiles of genes involved in similar processes among related species will be conserved Conversely, sequences that encode or control the expression of proteins or RNAs responsible for differences between species will be divergent
What species? How distant should the compared species be? Transgenic mice tend to express the transgene in a similar manner to the source mammalian organism, including those for which the ortholog does not exist in mice! Rapidly evolving regulatory regions (e.g., β globin) allow comparisons between closely related species However, some regulatory regions evolve very slowly (e.g., T-cell receptors), so comparisons with more distant species are required
Finding regulatory regions Find conserved regions between species Can also find conserved regions between similarly expressed genes Search for TF binding sites in these conserved regions Pennacchio and Rubin, Nature Rev Genet 2001
Human vs. mouse Similar gene content and linear organization ~340 syntenic blocks (~150 more than discovered using cytological studies) spanning 90% of the genome Difference in genome size Mouse genome is 14% smaller, probably due to a higher rate of deletion Sequence Conservation ~40% in alignments ~5% under selection ~1.5% protein coding ~3.5% non-coding: untranslated regions, regulatory elements, non-protein-coding genes, and chromosomal structural elements Nature 420:520, 2002
Human vs. mouse genomes Nature 420:520, 2002