How to usefully compare homologous plant genes and chromosomes as DNA sequences

Size: px

Start display at page:

Download "How to usefully compare homologous plant genes and chromosomes as DNA sequences"

Clifton Nash
6 years ago
Views:

1 The Plant Journal (2008) 53, doi: /j X x TECHNIQUES FOR MOLECULAR ANALYSIS How to usefully compare homologous plant genes and chromosomes as DNA sequences Eric Lyons * and Michael Freeling Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA 94720, USA Received 1 June 2007; revised 11 September 2007; accepted 17 September * For correspondence (fax ; elyons@nature.berkeley.edu). Summary There are four sequenced and publicly available plant genomes to date. With many more slated for completion, one challenge will be to use comparative genomic methods to detect novel evolutionary patterns in plant genomes. This research requires sequence alignment algorithms to detect regions of similarity within and among genomes. However, different alignment algorithms are optimized for identifying different types of homologous sequences. This review focuses on plant genome evolution and provides a tutorial for using several sequence alignment algorithms and visualization tools to detect useful patterns of conservation: conserved non-coding sequences, false positive noise, subfunctionalization, synteny, annotation errors, inversions and local duplications. Our tutorial encourages the reader to experiment online with the reviewed tools as a companion to the text. Keywords: plant comparative genomics, CNS, synteny, fractionation. Introduction Comparative genomics is founded on the assumption that much of life s language is contained in its linear DNA sequence. Comparisons of genomic DNA sequences be they from different species or, as with polyploids, within the same nucleus present one way to understand the syntax and vocabulary of this language. One great advantage of using whole rather than partial genome sequence is that comparisons may be made between the most closely related genes or regions in the genomes compared. Homologous genes or chromosomal regions are similar because they share a common ancestor, but finding the closest homolog may be inferred only by finding homologs within regions containing a similar pattern of gene content. If these best homologous DNA sequences are from different organisms, they are called orthologs (with some exceptions). If homologous sequences are within one genome, they are called paralogs. A special case of paralogy results from polyploidy. Duplicated genes or chromosomal regions derived from polyploidy are called homeologs. Comparisons among homeologs are routine when working with angiosperms. The fundamentals of comparative genomics and its nomenclature have been reviewed elsewhere (Koonin, 2005). Some definitions particularly important for plant scientists are given in Table 1. Comparison of biological parts to identify similarities is an ancient preoccupation originally used to order the natural world. The Linnaean classification system makes a fine example. Later, these comparisons were used to identify possible evolutionary trends. One such eukaryotic trend is towards increasing maximums of morphological complexity (Freeling and Thomas, 2006), but there are many more. The modern synthesis of Darwin s natural selection, genetic laws and some principles of population genetics (see Mayr, 1993) provides the most popular logic by which genomes are compared. It is known that genomic DNA sequences have regions with and without function, and different functional regions confer their function through different means. DNA may function by indirectly encoding protein, directly encoding RNA, binding macromolecules, directing or modifying the movement of regulatory molecules or being epigenetically modified. Compounding the matter, some DNA may have multiple simultaneous functions. Reducing a primary 661 Journal compilation ª 2008 Blackwell Publishing Ltd

2 662 Eric Lyons and Michael Freeling Table 1 Comparative genomic definitions of special relevance to angiosperms Orthologs Plant CNS Homeologs Fractionation Plant acns Genespace Phylogenetic footprint Local alignment algorithm Global alignment algorithm A pair of homologous genes or chromosomal regions derived from the same syntenous chromosomal positions in different species. Additional gene duplications and/or losses following speciation may result in complex relationships between sets of orthologous genes. A protocol for identifying conserved non-coding sequences (CNSs) in plants using a pair-wise BLASTN (Altschul et al., 1990) to identify high-scoring segment pairs (HSPs) between the non-protein-coding sequences near usefully diverged, orthologous or homeologous genes. These sequences are at least 15-bp long with an e-value equal to or more significant than a 15/15 exact nucleotide match (Inada et al., 2003; Kaplinsky et al., 2002). Other equally sensitive alignment algorithms can be substituted, as long as the 15/15 exact match significance cut-off is used. A pair of genes retained following polyploidy, identified by residing in syntenic regions of the chromosome. Being duplicates within the same organism, homeologs are a special case of paralogs, but all homeologs occurred contemporaneously, whereas the history of local gene duplicates is obscured by gene conversion. The mechanism by which a duplicated gene, chromosomal segment or genome tends to return to preduplication gene content, but not necessarily retain its pre-duplication gene order. Fractionation is the loss of one or the other of the initial homeologs, but not of both. The process of fractionation is associated with chromosomal rearrangements and transcriptome shock (Wang et al., 2006), and may help cluster dosesensitive genes (Thomas et al., 2006). As CNS above, but the chromosomal regions are homeologous (syntenous and paralogous) remnants of the most recent tetraploidy event (a) in the lineage (Thomas et al., 2006). BLAST results for a pairs in Arabidopsis are displayed and may be researched in a custom viewer: Subfunctionalization, defined in the text, is expected of homeologous pairs, but not in orthologous pairs. Furthermore, homeologs are under different selective constraints as compared with orthologs. Genespace is defined here as the space of an individual gene, which is a computational surrogate for cistron, where the total genespace of a genome is the sum of all of the genespaces of its genes. This gene-level genespace is computed after CNSs have been identified for a syntenic genomic region, and each CNS has been sorted to a gene: the segment of genome between the most 5 (upstream) and most 3 CNS, untranslated region (UTR) or feature, plus approximately 500 bp on each side (depending on neighboring features; Thomas et al., 2007). Within a genespace are exons, UTRs, CNSs, known motifs, positions where specific transcription factor binding sites reside and any feature that is fixed at a chromosomal locus. This non-standard term has little use in mammals because CNSs are difficult to sort to individual genes, but is particularly useful for plant research. The most inclusive term for the conserved sequence between two or more sequences without stipulations as to the extent of divergence. A CNS is a type of phylogenetic footprint. Computational method to identify local regions of sequence similarity between two or more biological sequences, where hits may or may not be collinear, and may be on either strand. Computational method to find the best possible alignment between two or more biological sequences that extends across the entire length of all sequences on the same DNA strand. If settings are not stringent enough, noise can look like syntenic conserved regions because global algorithms make all alignments collinear. DNA sequence to a particular set of biologically meaningful structures is daunting (Pearson, 2006). However, it is sometimes possible to find something meaningful about the biological function of DNA with incomplete structural knowledge by comparison with a related DNA sequence. It is at this juncture that comparative genomics can be useful because DNA that functions, without regard to mechanism, tends to have its primary sequence evolutionarily conserved (Hardison, 2000, 2003). Our purpose is limited. We have prepared a tutorial of DNA sequence comparison algorithms and data visualization tools commonly used by plant researchers. Using these tools, we identify the types of information that can be acquired, show how the ability to change alignment algorithms and parameters is crucial for discovery and illustrate how visualization of the results is almost as important as the resolution of the alignment algorithm itself. Plant (angiosperm) genomes are known to be different from mammalian or any other animal genomes in several important ways. Recent and ancient polyploidy is widespread among angiosperms. The former may be deduced from chromosome counts (Adams and Wendel, 2005), whereas detecting the latter requires a nearly complete genome sequence. Within the fully-sequenced genomes of Arabidopsis thaliana, poplar, rice and grape are the remnants of at least two ancient tetraploidies (Adams and Wendel, 2005; Bowers et al., 2003; De Bodt et al., 2005; Jaillon et al., 2007; Paterson et al., 2005; Tuskan et al., 2006). Ancient tetraploidies cannot be inferred from chromosome counts because fractionation, the mechanism of genomic content loss that naturally follows all types of DNA duplications, often returns a polyploid to a chromosomal number and gene count more like that of its pre-polyploid ancestor. In addition to polyploidy, plant genomes contain much transposon-derived DNA. Such DNA is usually only a few

3 How to usefully compare plant genomes 663 million years old, and is often found both in locally repeated blocks and spread throughout the genome. Researchers must be aware that sequences of this highly repetitive nature often obfuscate comparisons among and within plant genomes. Finally, the region around genes that contains additional non-protein coding functional sequence is structured differently in angiosperms as compared with mammals. These sequences are identified by comparison of duplicated chromosomal regions, and are often called conserved non-coding sequences (CNSs; Table 1). Mammalian CNSs are approximately 10 times larger and are much more numerous than plant CNSs when using alignment cut-offs appropriate for plant CNS discovery (Kaplinsky et al., 2002). Were the most popular CNS alignment cut-off used in animal research (100-bp long with >70% identity; Loots et al., 2000) applied to plants, plants would have nearly zero CNSs (Gao and Innan, 2004; Inada et al., 2003; Thomas et al., 2007). It follows that CNSs are more deeply conserved in the vertebrate lineage than in the angiosperm linage. Vertebrates have over a thousand enhancer-like conserved non-coding sequences that have been conserved since the divergence of human and fish 450 Mya (Goode et al., 2005; Ovcharenko et al., 2005b; Siepel et al., 2005; Woolfe et al., 2005). Although those most-conserved plant CNSs may also operate as enhancers, plants do not have such deeply conserved CNSs (Freeling et al., 2007). As originally observed by Kaplinsky et al. (2002), mammalian CNSs often occur continuously down a chromosome, so the assignment of any one of them to a particular gene is not possible using spacing alone. Work on maize rice (Inada et al., 2003, Guo and Moose, 2003), Brachypodium rice (Bossolini et al., 2007) and especially alignments of the two most recent post-tetraploid genomes within Arabidopsis (Freeling, 2007; Thomas et al., 2007) all demonstrate that almost all plant CNSs cluster near one gene: this cluster of conservation has been used to estimate what we call a single genespace. This non-standard term is defined in Table 1. The reasons why plants, as compared with mammals, have less conserved sequence between genes is not known, but this articulated pattern of conservation permits assigning CNSs to genes, and this information is powerful. For example, the most CNS-rich genes in Arabidopsis are transcription factors known to be necessary for response to environmental signals (Freeling et al., 2007). The concept of synteny is essential to any comparison of homologous genes or chromosomes. Given the inherent complexity of this term, the definition that follows is simply how we use this term. Two or more once-duplicated sequences are said to be syntenic when it is possible, using extant genomic data, to reconstruct a valid ancestral sequence from which the sequences originated. When two chromosomal regions have mainly co-linear genes or other features, they are obviously derived from a common ancestral genomic region and are considered syntenic. In reality, duplication is followed by an evolutionary winnowing process (called fractionation, see Table 1) that includes gene loss, inversions, translocations, insertions, deletions and epigenetic marks. This results in a loss of collinearity of genes and other features, but it is often possible to reconstruct a putative ancestor nevertheless. An outgroup genome is often necessary to prove synteny, especially if the remaining duplicate regions share zero or few conserved sequences. When this reconstruction is possible, the duplicate regions are called syntenous or syntenic. Only the post-duplication movement of single genomic features to another genomic region destroys our ability to detect synteny. Because of ancient polyploidies in all angiosperms, synteny is evidenced within plant genomes, and not just between them. The tutorial that follows uses synteny between homeologous (i.e. syntenic, paralogous; Table 1) genomic features in several ways to identify patterns of evolution in plant genomes. Duplication may be of varying degrees of completeness: local (tandem), segmental, whole chromosome and whole genome (polyploidy). Each sort of duplication has very different selective constraints (Koonin, 2005) and dosage effect/compensation expectations (Birchler et al., 2005; see Freeling and Thomas, 2006). For example, prevalent gene conversion makes comparisons among locally duplicated genes challenging because it unlinks their date of origination from their observed degree of divergence (exemplified in yeast; Gao and Innan, 2004). Some homologous DNA sequence comparisons are meaningful only if the DNA sequences have diverged to a useful level. In theory, DNA sequence without specific function will either accumulate point mutations at the background rate of the region or may be deleted altogether (if such a mechanism operates). The former is true for many third codon position base substitutions in the protein coding sequence, and is true for all of the non-functional sequence. As the non-functional sequence changes more quickly than the functional sequence, there is a point in evolutionary time that conservation of the sequence is evidence of function. Conversely, if the level of sequence divergence is small, conservation is expected because of carry-over. Although adequate divergence is essential, it is important to realize that there can be too much. For example, homologous regulatory sites are known to lose sequence similarity even though binding function is conserved (called binding site turnover ; Ludwig et al., 2005; Moses et al., 2006). There is certainly a window of useful divergence when comparing plant non-coding sequences. The three first papers on plant CNS discovery (Guo and Moose, 2003; Inada et al., 2003; Kaplinsky et al., 2002) established the maize rice divergence time as being appropriate for CNS detection, and argued that maize rice diverged to approximately the same extent as mouse man. Sufficient divergence for detecting CNSs in plants is indicated when the

4 664 Eric Lyons and Michael Freeling Table 2 Alignment and visualization tools used in the tutorial Alignment algorithm Algorithms type Visualization Web service Avid (Bray et al., 2003) Global VISTA (Mayor et al., 2000) BLASTN (Altschul et al., 1990; Local GELO * Tatusova and Madden, 1999) BLASTZ (Schwartz et al., 2000, 2003) Local GELO * DiAlign (Morgenstern, 1999; Global ABC (Couper et al., 2004) Morgenstern et al., 1998; Pohler et al., 2005) Lagan/Shuttle Lagan Global VISTA (Brudno et al., 2003a,b) Mulan (Ovcharenko et al., 2005a) Local MULAN * GELO is our own visualization package (to be published elsewhere). average BLAST high-scoring segment pair (HSP) between orthologous/homeologous coding regions is approximately 85% identical in nucleotide sequence (unpublished rule-ofthumb, M. Freeling). However, the definitive find plant CNS settings may always be adjusted so that only significant noncoding alignments are detected. Subfunctionalization is the natural process whereby duplicate cis-acting units of function (e.g. exons and enhancers) tend to lose dispensable sequences in a compensatory fashion (Force et al., 1999; Lynch and Force, 2000). This results in the full set of functions of the ancestral gene being divided between both duplicates so no one gene is complete. Subfunctionalization of cis-acting regulatory DNA sequences has been noticed in plants (Haberer et al., 2004; Langham et al., 2004). When paralogs are aligned, subfunctionalized regions of the sequence cannot be seen because they exist in only one of the two duplicates. An appropriate outgroup capable of better representing the ancestor of the duplicates is required to identify subfunctionalized sequences. Comparing genomic sequence The workhorses of comparative genomics are sequence alignment algorithms. Sequence alignment algorithms break into two major classes: global and local. Global alignments (Needleman and Wunsch, 1970) generate the best alignment across the whole length of the sequences, whereas local alignments (Smith and Waterman, 1981) find as many best-subsequence alignments as possible. The usefulness of results generated by these two classes of alignment algorithms depends on the type of genomic region analyzed, and how much false-positive noise is retained in the results. In general, if the compared regions are believed to be similar across their entire lengths, then global alignment algorithms are preferred. Cases of inversion and local duplication both common in syntenic regions violate this assumption of collinearity, and local alignment algorithms are generally preferred. Also, different algorithms in each class are optimized for different alignment tasks. To highlight these differences in algorithm classes and optimizations, we use six alignment algorithms (three global and three local) in the tutorial. These are listed in Table 2. Visualization software We are beginning to see the development of modular alignment visualization software that can be used with the output from any sequence alignment algorithm. VISTA (Mayor et al., 2000) is a prime example of this paradigm and has been used for visualizing alignment results from several algorithms. Similarly, we have developed our own genome visualization module, GELO, which we use in the tutorial for visualizing BLAST results, and which is now being used to display the results from several alignment algorithms. Table 3 Links for regenerating and modifying the BLASTN and BLASTZ examples used in the tutorial, and for obtaining sequence and annotations files in FASTA and GAF (Gene Annotation Format) format, respectively Figure 1. CNS detection Figure 2. HSP spike filer Figure 3. Subfunctionalization Figure 4. Synteny Figure 5. Annotation error Figure 6. Inversion Figure 7. Local duplication a: b: a: b: c: d: a: b: c: a: b: a: a: b: These output files may be exported to other sequence analysis applications.

How to usefully compare plant genomes 665 (a) BLASTN (settings for plant CNS discovery), local alignment HSP Gene model (b) BLASTZ (default), local aligment Sequence similarity Gene model (c) Mulan

5 How to usefully compare plant genomes 665 (a) BLASTN (settings for plant CNS discovery), local alignment HSP Gene model (b) BLASTZ (default), local aligment Sequence similarity Gene model (c) Mulan (100 bp, 70% sequence identity, animal CNS settings), local alignment Gene model (d) Sequence similarity Mulan (15 bp, 70% sequence identity), local alignment Gene model Sequence similarity (e) Chaos-DiAlign (default), global alignment Sequence similarity (f) Avid (default), global alignment Gene model Sequence similarity (g) Lagan (default), global alignment Gene model Sequence similarity Figure 1. Detecting conserved.ncon-coding sequences (CNSs) in plants with various sequence comparison algorithms, settings and visualization software. Each analyzes the genespace from an Arabidopsis pair of transcription factor genes (TAIR version 7 At2g18550 and At4g3740) derived from its most recent polyploidy. (a) Shows alignment to both genomic regions; (b g) shows alignment to the genespace of At4g3740 only. (a) BLASTN using CNS discovery settings for plants (-W 7 -G 5 -E 2 -q )2 -r 1) and a 15-bp spike sequence; GELO visualization. CNSs identified by Thomas et al. (2007) are highlighted by blue double arrows. (b) BLASTZ (default settings) and GELO visualization. (c) MULAN using mammalian CNS discovery settings of 100 bp, 70% sequence identity. (d) MULAN using plant CNS discovery settings of 15 bp, 70% sequence identity. (e) Chaos-DiAlign (default) and ABC visualization. (f) Avid (default settings) and VISTA visualization. (g) Lagan (default settings) and VISTA visualization. Although alignment figures were aligned with respect to one another for easy cross-comparison of identified regions of similarity, the output from DiAlign introduced gaps in the genomic region from chromosome 4 that extended the length of this region.

666 Eric Lyons and Michael Freeling (a) BLASTN, e-value cutoff based on 12bp spike sequence (b) BLASTN, e-value cutoff based on 13bp spike sequence (c) BLASTN, e-value cutoff based on 14bp spike

Conserved non-coding sequence (CNS) discovery in Arabidopsis between the homeologous pair of genes At1g01030 and At4g01500, including 5000 nucleotides upstream and downstream of each gene.

(2007), based on e-value cut-off values calculated by spiking the sequences with a known exact match sequence of variable length, and removing any HSP with an e value greater than the HSP containing

6 666 Eric Lyons and Michael Freeling (a) BLASTN, e-value cutoff based on 12bp spike sequence (b) BLASTN, e-value cutoff based on 13bp spike sequence (c) BLASTN, e-value cutoff based on 14bp spike sequence (d) BLASTN, e-value cutoff based on 15bp spike sequence Figure 2. Rising above the noise. Conserved non-coding sequence (CNS) discovery in Arabidopsis between the homeologous pair of genes At1g01030 and At4g01500, including 5000 nucleotides upstream and downstream of each gene. BLASTN was used to find regions of sequence similarity (-W 7 -G 5 -E 2 -q )2 -r 1). These comparisons use a high-scoring segment pair (HSP) filter devised by Thomas et al. (2007), based on e-value cut-off values calculated by spiking the sequences with a known exact match sequence of variable length, and removing any HSP with an e value greater than the HSP containing the spike sequence. Empirically evaluating the results shows that the 15-bp spike sequence is appropriate for removing noise from the analysis. (a) e-value cut-off based on a 12-bp spike sequence. (b) e-value cut-off based on a 13-bp spike sequence. (c) e-value cut-off based on a 14-bp spike sequence. (d) e-value cut-off based on a 15-bp spike sequence.

How to usefully compare plant genomes 667 Figure 3. Subfunctionalization of conserved non-coding sequences (CNSs) illustrated via GenBank accession numbers.

7 How to usefully compare plant genomes 667 Figure 3. Subfunctionalization of conserved non-coding sequences (CNSs) illustrated via GenBank accession numbers. BLASTN comparison of three homologous gene regions using plant CNS settings and a 15-bp spike. Two maize homeologs are compared with a rice outgroup ortholog (GenBank accessions AY180106, AY and AP003287, respectively). GEvo permits using the reverse complement of any sequence (along with its annotations), and permits selecting a reference sequence (in this case, rice). BLAST high-scoring segment pairs (HSPs) are blue and green numbered boxes. CNSs that have subfunctionalized are indicted within purple ovals. Subfunctionalization within these genespaces has been discovered previously (Langham et al., 2004). Note the holes in the 5 region of the upper maize homeolog (AY180106): these are probably recent transposon insertions. Tutorial In the tradition of online tutorials for web applications, the following short manual is written colloquially. The reader becomes you at this point in the discourse and we provide you links to our web application for regenerating our examples and figures, as well as generating sequence and annotation sets for import into other comparative genomic tools (Table 3). There are many tools available and these have been reviewed elsewhere (recently by Pollard et al., 2006). To illustrate differences and similarities between different commonly used alignment algorithms, we chose six of them for our examples (Table 2). Please note that there are several ways to import sequences (and associated annotations) into our, and most other, web applications: retrieval from a local database, import from GenBank via an accession number, or directly submitting a sequence in FASTA or GenBank format. Detecting CNSs Figure 1 shows the results from the algorithms in Table 2 applied to a pair of genespaces in Arabidopsis: these genespaces from chromosomes 2 and 4, which contain several kb of sequence, were chosen because they are homeologous and, using BLASTN terminology (Figure 1a), they contain 14 short HSPs and three coding sequence (CDS; reading frame) HSPs. Almost all HSPs are collinear. These are typical plant CNSs (except that HSP2 is actually a microrna gene), and any detailed analysis of these genes should detect these dispersed, short stretches of non-coding DNA sequence conservation. This BLASTN analysis (Figure 1a), using the settings and noise filter defined in Table 1, is our baseline for comparison with the other five alignment algorithms. For each of the other algorithms, only the alignment to the genespace on chromosome 4 is shown. BLASTZ (Figure 1b) identified the pair of homeologous coding sequences with a single HSP that covered the entire gene model and extended into its 5 non-coding sequence. Although this 5 extension covered one CNS identified by BLASTN (HSP8, Figure 1a), it failed to find any distal CNSs. Although we can conclude that BLASTZ can easily identify putative gene homologs, it is not appropriate for finding plant CNSs. MULAN, another local alignment tool, provides a VISTA-like visualization of results for identifying CNSs that can filter the alignments based on the minimum length of CNSs and their percentage sequence similarity. Applying animal CNS settings (100-bp length, 70% identity,) MULAN identified the 5 -most cluster of CNSs (Figure 1c), but did not find the intervening CNSs identified by BLASTN. We changed the MULAN filter to be similar to plant CNSs settings by lowering its minimum length to 15 bp (Figure 1d). Although the 5 CNS cluster covered more sequence, MULAN still missed the same CNSs as with the animal settings.

$668 Eric Lyons and Michael Freeling (a) (b) (c) Figure 4. Detecting synteny and fractionation.$

8 668 Eric Lyons and Michael Freeling (a) (b) (c) Figure 4. Detecting synteny and fractionation. (a) BLASTN and (b) BLASTZ sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. Upper region, chromosome 1 identified by gene At1g07300; lower region, chromosome 2 identified by gene At2g These regions comprise of six pairs of homeologous genes (blue double arrows). The upper and lower regions have five and six genes, respectively, that do not have homeologs (purple ovals), and the lower region has one annotated pseudogene (orange oval). Blue numbered boxes mark regions of sequence similarity identified by BLAST. (c) Comparison of the two intragenomic syntenic regions from (a) wth a syntenic outgroup sequence from Vitus viniferai anchored by gene GSVIV Green and red numbered boxes are BLASTZ high-scoring segment pairs (HSPs) between the intragenomic regions of the in-group and the outgroup sequence. Red and green ovals and arrows identify genes and their orthologous regions in the outgroup sequence. Purple ovals identify genes not present in the outgroup sequence; the orange oval is a pseudogene. By comparison with an outgroup sequence, fractionation of gene content becomes apparent. All homeologous gene pairs and the majority of singlet genes are represented in the outgroup sequence. Notice that many annotated grape genes are not represented in the two Arbidopsis chromosomes shown. This is expected because two tetraploidies occurred along the Arabidopsis lineage, whereas none happened along the grape lineage; there is another equally syntenic pair of Arabidopsis chromosomal regions that are the fractionation products of this segment (Jaillon et al., 2007). Figure 1(e g) show the results of the global alignment algorithms DiAlign (which uses the local alignment algorithm Chaos for anchors; Brudno and Morgenstern, 2002), Avid and Lagan. All these algorithms identified the pair of homeologous genes and the 5 distal CNS cluster. Chaos- DiAlign and Lagan found several of the intervening CNSs identified by BLASTN (although the Chaos-DiAlign server did not support adding gene annotations to the ABC visualization). Although BLASTN may not be the most appropriate alignment algorithm for all comparative genomics problems, this comparison shows that it performs well for detecting plant CNSs.

How to usefully compare plant genomes 669 (a) (b) Figure 5. Detecting annotation errors. (a) Alignment of a homeologous gene pair in Arabidopsis with an annotation error using BLASTN.

(b) Alignment of two syntenic intragenomic regions from (a) to the syntenic region of an outgroup (Vitus vinifera) using BLASTN.

BLASTN reduces false-positive noise in its alignments (i.e. HSPs, blast hits ) by using the concept of an expect value (e-value).

9 How to usefully compare plant genomes 669 (a) (b) Figure 5. Detecting annotation errors. (a) Alignment of a homeologous gene pair in Arabidopsis with an annotation error using BLASTN. The regions analyzed included 2500 nucleotides of the 5 and 3 regions of genes At1g07300 and At2g (b) Alignment of two syntenic intragenomic regions from (a) to the syntenic region of an outgroup (Vitus vinifera) using BLASTN. Here it is evident that the Arabidopsis gene model for At2G29640 probably encompasses two genes, the 5 section of which has been lost from the other Arabidopsis syntenic region. Where is the noise? BLASTN reduces false-positive noise in its alignments (i.e. HSPs, blast hits ) by using the concept of an expect value (e-value). An in-depth discussion of the e-value calculation in BLAST is beyond the scope of this review, so please see html for full details. To facilitate CNS research, Thomas et al. (2007); and the Arabidopsis CNS website, cnr.berkeley.edu/atcns) devised a heuristic method to efficiently filter noise from genespaces of various lengths. These workers (from this laboratory) added an identical sequence of known length (called a spike sequence) to the 3 end of the compared sequences. Using BLASTN to generate HSPs, they identified the HSP containing the spike sequence and removed all other HSPs of greater e-value. They found that a 15-bp spike sequence eliminated most of the noise from their analyses. Figure 2 shows a short syntenic region within Arabidopsis subjected to various levels of noise filtration using spike sequences of various lengths. Note that the filter with a 15-bp spike sequence eliminates all noise from the analysis (Figure 2c,d), leaving four CNSs. We leave it to you to try various other spike sequence lengths and gap/mismatch penalties using the links in Table 3. Subfunctionalization of CNSs Now that we can detect CNSs in homeologous genespaces, we will extend this to include an outgroup sequence for the purpose of identifying CNSs that are shared or fractionated/subfunctionalized. Figure 3 shows BLASTN comparisons with a 15-bp spike sequence of two homeologous maize genes (liguleless-like transcription factors retained from a tetraploidy that happened approximately 15 Ma) to a rice ortholog. In this example, each maize gene has several CNSs in common with rice that are not shared in the homeologous genespace (CNSs highlighted with purple ovals), which is evidence for subfunctionalization of the non-coding sequence of the maize genes. Synteny demonstration Figure 4(a,b) visualize synteny between two intragenomic regions of Arabidopsis using BLASTN and BLASTZ, respectively. In both analyses, there are six pairs of genes that share a high degree of sequence similarity (blue double arrows) and are collinear, demonstrating synteny. However, the results from BLASTZ are easier to interpret visually. There are several genes in each region that do not have a

The upper region is from chromosome 1 identified by gene At1g02690, and the lower region is from chromosome 4 identified by gene At4g02150. (a) BLASTZ. (b) Shuffle-Lagan.

10 670 Eric Lyons and Michael Freeling (a) BLASTZ (++) HSP (+-) HSP (b) Shuffle-Lagan Inversion boundary Inversion boundary Figure 6. Detecting inversions. Sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis with an inversion in one region. The upper region is from chromosome 1 identified by gene At1g02690, and the lower region is from chromosome 4 identified by gene At4g (a) BLASTZ. (b) Shuffle-Lagan. Blue arrows highlight homeologs, red arrows highlight homeologs in an inverted chromosomal region and the orange arrow represents a putative non-annotated gene on chromosome 1. Both algorithms identified the inversion event. corresponding homeolog (purple ovals), which we assume is to the result of fractionation. To demonstrate fractionation comparison with an outgroup sequence is necessary. For understanding intragenomic fractionation, such an outgroup would ideally have diverged before the intragenomic duplication event and not undergone a duplication event of its own. Figure 4(c) shows an example of this using Vitis vinifera (grape; Jaillon et al., 2007). In this example, although the intragenomic regions share a subset of their gene content, the unannotated outgroup contains the majority of the gene content and evidences fractionation. In addition, the outgroup comparison allows us to infer the pre-duplication ancestral state of the intragenomic syntenic regions, and track which genes have been preserved as singlets or retained as duplicates. Expect errors in genomic sequences and annotation If you examine the homeologous gene pair At1g07300 and At2g29640 from the previous example (Figure 4b, yellow exons in gene models), you will notice that these sequences have been assigned very different exon structures with respect to one another. Looking at their shared sequence similarity, you will notice that the BLASTZ HSP covers and extends beyond the 5 end of At1g07300, and partially covers the gene model of At2g This difference may indicate that the genes are evolving in a unique fashion or that an annotation error was made. In either case, this pair of homeologs needs closer inspection. Although Arabidopsis gene models are certainly the best current models in plants, many Arabidopsis gene models are incorrect (Thomas et al., 2007). Figure 5(a) shows a pair-wise analysis of the two Arabidopsis regions using BLASTN. There are two clusters of HSPs with one set (HSPs 1 and 2) covering the entire coding region of At1g07300 and the 3 exon of At2g29640, and the other set (HSPs 3, 4, 5, 6 and 7) covers the 5 region of At1g07300 and the intronic region of At2g29640, with one HSP overlapping a middle coding exon. The general lack of congruence of gene models for these homeologs and their odd placement of sequence similarity suggests an annotation error. Checking

11 How to usefully compare plant genomes 671 (a) 3 2 (b) Figure 7. Detecting local duplications. (a) BLASTZ comparison of a genomic region with itself (chromosome 1) identifies a local duplicate (red double arrows) consisting of genes At1g07440 and At1g (b) BLASTZ comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. The upper region is from chromosome 1 and the lower region is from chromosome 2 identified by genes At1g07440 and At2g29300, respectively. Blue numbered boxes mark regions of sequence similarity identified by BLASTZ. Blue arrows highlight syntenic paralogous gene pairs. Regions highlighted in red denote the expansion or contraction of a gene family. their annotations at TAIR ( there is full-length cdna support for At1g07300 and none for At2g This implies that the gene model for At1g07300 is correct and that At2g29640 has an annotation error. For further analysis of annotation errors uncovered through comparison of syntenic regions, an outgroup sequence is needed. Figure 5(b) shows the two intragenomic regions with an annotation error compared with an outgroup sequence (grape, V. vinifera). Here, we can see that both At1g07300 and At2g29640 have some 3 sequence similarity to the outgroup. However, At2g29640 also has 5 sequence similarity to the outgroup that is not present in the 5 region of At1g Also, the 5 cluster of Arabidopsis HSPs in the non-coding sequence is not present in the outgroup. This suggests that the 5 cluster of HSPs are CNSs and that At2g29640 may represent two genes, one of which has been retained in the syntenic Arabidopsis genomic region, and one that has been fractionated. In addition, you will notice an HSP stack (HSPs 5 8) in the comparison between At1G07300 and V. vinifera. This results from a simple sequence repeat (in this case a GAGA repeat) and the way in which BLAST identifies regions of sequence similarity. Inversions happen frequently, and break collinearity Figure 6 compares two alignment tools, BLASTZ and Shuffle- Lagan, for their ability to identify an inversion within a syntenic region. Although both algorithms were able to identify an inversion containing at least four genes as well as a putatively missed gene in one region, identifying regions of similarity is easier using the GELO visualization for BLASTZ. Local duplications are very common and can greatly clutter alignment graphics Duplications of two general types are shown in Figure 7 using BLASTZ. Figure 7(a) shows a region with a local duplication compared with itself, and Figure 7(b) shows a region with a local duplication compared with its syntenic region containing 12 local duplicates (two of which are pseudogenes). Notice that HSP1 in Figure 7(a) nearly covers the entire sequence. This is to the result of comparing a genomic region against itself. Also, notice the HSP stacks in Figure 7(b) that happen when one gene is present in many copies in the other region. Although the HSP numbers overlap and are difficult to interpret, which is a limitation of

12 672 Eric Lyons and Michael Freeling this type of visualization, the expansion of this gene family by local duplication is apparent. Conclusion Looking forward Plant biologists face an exciting future. There will be about a dozen plant genomes completed over the next few years, bringing opportunities to characterize new patterns of similarity and change in the structure and content of plant genomes. A challenge will be to associate phenotypes with specific patterns of sequence conservation. For example, many sequence motifs identified in CNSs are under active selection, but we know little else about their function, either biochemically or phenotypically. Although we know that there are different selective environments for genomes arising from polyploidy versus speciation, we do not yet understand the evolutionary constraints and consequences. However, we know that after polyploidy, genes are retained or lost based on their family type and their ancestral genomic region. Apart from knowing that gene dosage is primarily important for retention, and that subfunctionalization is particularly important after retention (Freeling, 2007), we do not understand exactly how bias of gene content occurs as a consequence of duplications. Comparative genomics is a young and vibrant field, and is especially so for plants because plant genespace is relatively less complex than that in mammals, and because tetraploidies offer many advantages for analysis. At the core of this enterprise are DNA alignment and visualization tools, some of which are reviewed here. References Adams, K.L. and Wendel, J.F. (2005) Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8, Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, Birchler, J.A., Riddle, N.C., Auger, D.L. and Veitia, R.A. (2005) Dosage balance in gene regulation: biological implications. Trends Genet. 21, Bossolini, E., Wicker, T., Knobel, P.A. and Keller, B. (2007) Comparison of orthologous loci from small grass genomes Brachypodium and rice: implications for wheat genomics and grass genome annotation. Plant J. 49, Bowers, J.E., Chapman, B.A., Rong, J. and Paterson, A.H. (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature, 422, Bray, N., Dubchak, I. and Pachter, L. (2003) AVID: a global alignment program. Genome Res. 13, Brudno, M. and Morgenstern, B. (2002) Fast and sensitive alignment of large genomic sequences. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). IEEE Computer Society Press. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, Green, E.D., Sidow, A. and Batzoglou, S. (2003a) LAGAN and Multi-Lagan: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, Brudno, M., Malde, S., Poliakov, A., Do, C.B., Couronne, O., Dubchak, I. and Batzoglou, S. (2003b) Glocal alignment: finding rearrangements during alignment. Bioinformatics, 19(Suppl. 1), I54 I62. Couper, G.M., Singaravelu, S.A.G. and Sidow, A. (2004) ABC: software for interactive browsing of genomic multiple sequence alignment data. BMC Bioinformatics, 5, 192. De Bodt, S., Maere, S. and Van de Peer, Y. (2005) Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20, Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L. and Postlethwait, J. (1999) Preservation of duplicate gene by complementary degenerative mutations. Genetics, 151, Freeling, M. (2007) The evolutionary position of subfunctionalization, downgraded. Genome Dyn., in press. Freeling, M., Rapaka, L., Lyons, E., Pedersen, B. and Thomas, B.C. (2007) G-boxes, bigfoot genes, and environmental response: characterization of intragenomic conserved noncoding sequences in Arabidopsis. Plant Cell, 19, Freeling, M. and Thomas, B.C. (2006) Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res. 16, Gao, L.Z. and Innan, H. (2004) Very low gene duplication rate in the yeast genome. Science, 306, Goode, D.K., Snell, P., Smith, S.F., Cooke, J.E. and Elgar, G. (2005) Highly conserved regulatory elements around the SHH gene may contribute to the maintenance of conserved synteny across human chromosome 7q36.3. Genomics, 86, Guo, H. and Moose, S.P. (2003) Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell, 15, Haberer, G., Hindemitt, T., Meyers, B.C. and Meyers, K.F. (2004) Trandscriptional similarities, dissimilarities and conservation of cis-acting elements in duplicated genes in Arabidopsis. Plant Physiol. 136, Hardison, R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16, Hardison, R.C. (2003) Comparative Genomics. PLoS Biol. 1, E58. Inada, D.C., A. Bashir, A., Lee, C., Thomas, B.C., Ko, C., Goff, S.A. and Freeling, M. (2003) Conserved noncoding sequences in the grasses. Genome Res. 13, Jaillon, O., Aury, J., Noel, B. et al. (2007) The grapevine genome sequence suggests ancestral hexploidization in major angiosperm phyla. Nature, 449, Kaplinsky, N.J., Braun, D.M., Penterman, J., Goff, S.A. and Freeling, M. (2002) Utility and distribution of conserved noncoding sequences in the grasses. Proc. Natl Acad. Sci. USA, 99, Koonin, E.V. (2005) Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, Langham, R.J., Walsh, J., Dunn, M., Ko, C., Goff, S.A. and Freeling, M. (2004) Genomic duplication, fractionation and the origin of regulatory novelty. Genetics, 166, Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M. and Frazer, K.A. (2000) Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288, Ludwig, M.Z., Palsson, A., Alekseeva, E., Bergman, C.M., Nathan, J. and Kreitman, M. (2005) Functional evolution of a cis-regulatory module. PLoS Biol. 3, e93.

13 How to usefully compare plant genomes 673 Lynch, A. and Force, A. (2000) The probability of duplicate gene preservation by subfunctionalization. Genetics, 154, Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S. and Dubchak, I. (2000) VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16, Mayr, E. (1993) What was the Evolutionary Synthesis? Trends Ecol. Evol. 8, Morgenstern, B. (1999) DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment. Bioinformatics, 15, Morgenstern, B., French, K., Dress, A. and Werner, T. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics, 14, Moses, A.M., Pollard, D.A., Nix, D.A., Iyer, V.N., Li, X.Y., Biggin, M.D. and Eisen, M.B. (2006) Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2, e130. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, Ovcharenko, I., Loots, G.G., Giardine, B.M., Hou, M., Ma, J., Hardison, R.C., Stubbs, L. and Miller, W. (2005a) Mulan: multiplesequence local alignment and visualization for studying function and evolution. Genome Res. 15, Ovcharenko, I., Loots, G.G., Nobrega, M.A., Hardison, R.C., Miller, W. and Stubbs, L. (2005b) Evolution and functional classification of vertebrate gene deserts. Genome Res. 15, Paterson, A.H., Bowers, J.E., Van de Peer, Y. and Vandepoele, K. (2005) Ancient duplication of cereal genomes. New Phytol. 165, Pearson, H. (2006) Genetics: what is a gene? Nature, 441, Pohler, D., Werner, N., Steinkamp, R. and Morgenstern, B. (2005) Multiple alignment of genomic sequences using CHAOS, DIALIGN, and ABC. Nucleic Acids Res. 33, Pollard, D., Bergman, C., Stoye, J., Celniker, S. and Eisen, M. (2006) Benchmarking tools for the alignments of functional noncoding DNA. BMC Bioinformatics, 5, 6. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R.C. and Miller, W. (2000) PipMaker a Web server for aligning two genomic DNA sequences. Genome Res. 10, Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13, Siepel, A., Bejserano, G., Pedersen, J.S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, Tatusova, T.A. and Madden, T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174, Thomas, B.C., Pedersen, B. and Freeling, M. (2006) Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res. 16, Thomas, B.C., Rapaka, L., Lyons, E., Pedersen, B. and Freeling, M. (2007) Intragenomic conserved noncoding sequences in Arabidopsis. Proc. Natl Acad. Sci. 104, Tuskan, G.A., Difazio, S., Jansson, S. et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, Wang, J., Tian, L., Lee, H.S. et al. (2006) Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics, 172, Woolfe, A., Goodson, M., Goode, D.K. et al. (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7.

Multiple Alignment of Genomic Sequences

Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part