How to usefully compare homologous plant genes and chromosomes as DNA sequences

Size: px
Start display at page:

Download "How to usefully compare homologous plant genes and chromosomes as DNA sequences"

Transcription

1 The Plant Journal (2008) 53, doi: /j X x TECHNIQUES FOR MOLECULAR ANALYSIS How to usefully compare homologous plant genes and chromosomes as DNA sequences Eric Lyons * and Michael Freeling Department of Plant and Microbial Biology, University of California, Berkeley, Berkeley, CA 94720, USA Received 1 June 2007; revised 11 September 2007; accepted 17 September * For correspondence (fax ; elyons@nature.berkeley.edu). Summary There are four sequenced and publicly available plant genomes to date. With many more slated for completion, one challenge will be to use comparative genomic methods to detect novel evolutionary patterns in plant genomes. This research requires sequence alignment algorithms to detect regions of similarity within and among genomes. However, different alignment algorithms are optimized for identifying different types of homologous sequences. This review focuses on plant genome evolution and provides a tutorial for using several sequence alignment algorithms and visualization tools to detect useful patterns of conservation: conserved non-coding sequences, false positive noise, subfunctionalization, synteny, annotation errors, inversions and local duplications. Our tutorial encourages the reader to experiment online with the reviewed tools as a companion to the text. Keywords: plant comparative genomics, CNS, synteny, fractionation. Introduction Comparative genomics is founded on the assumption that much of life s language is contained in its linear DNA sequence. Comparisons of genomic DNA sequences be they from different species or, as with polyploids, within the same nucleus present one way to understand the syntax and vocabulary of this language. One great advantage of using whole rather than partial genome sequence is that comparisons may be made between the most closely related genes or regions in the genomes compared. Homologous genes or chromosomal regions are similar because they share a common ancestor, but finding the closest homolog may be inferred only by finding homologs within regions containing a similar pattern of gene content. If these best homologous DNA sequences are from different organisms, they are called orthologs (with some exceptions). If homologous sequences are within one genome, they are called paralogs. A special case of paralogy results from polyploidy. Duplicated genes or chromosomal regions derived from polyploidy are called homeologs. Comparisons among homeologs are routine when working with angiosperms. The fundamentals of comparative genomics and its nomenclature have been reviewed elsewhere (Koonin, 2005). Some definitions particularly important for plant scientists are given in Table 1. Comparison of biological parts to identify similarities is an ancient preoccupation originally used to order the natural world. The Linnaean classification system makes a fine example. Later, these comparisons were used to identify possible evolutionary trends. One such eukaryotic trend is towards increasing maximums of morphological complexity (Freeling and Thomas, 2006), but there are many more. The modern synthesis of Darwin s natural selection, genetic laws and some principles of population genetics (see Mayr, 1993) provides the most popular logic by which genomes are compared. It is known that genomic DNA sequences have regions with and without function, and different functional regions confer their function through different means. DNA may function by indirectly encoding protein, directly encoding RNA, binding macromolecules, directing or modifying the movement of regulatory molecules or being epigenetically modified. Compounding the matter, some DNA may have multiple simultaneous functions. Reducing a primary 661 Journal compilation ª 2008 Blackwell Publishing Ltd

2 662 Eric Lyons and Michael Freeling Table 1 Comparative genomic definitions of special relevance to angiosperms Orthologs Plant CNS Homeologs Fractionation Plant acns Genespace Phylogenetic footprint Local alignment algorithm Global alignment algorithm A pair of homologous genes or chromosomal regions derived from the same syntenous chromosomal positions in different species. Additional gene duplications and/or losses following speciation may result in complex relationships between sets of orthologous genes. A protocol for identifying conserved non-coding sequences (CNSs) in plants using a pair-wise BLASTN (Altschul et al., 1990) to identify high-scoring segment pairs (HSPs) between the non-protein-coding sequences near usefully diverged, orthologous or homeologous genes. These sequences are at least 15-bp long with an e-value equal to or more significant than a 15/15 exact nucleotide match (Inada et al., 2003; Kaplinsky et al., 2002). Other equally sensitive alignment algorithms can be substituted, as long as the 15/15 exact match significance cut-off is used. A pair of genes retained following polyploidy, identified by residing in syntenic regions of the chromosome. Being duplicates within the same organism, homeologs are a special case of paralogs, but all homeologs occurred contemporaneously, whereas the history of local gene duplicates is obscured by gene conversion. The mechanism by which a duplicated gene, chromosomal segment or genome tends to return to preduplication gene content, but not necessarily retain its pre-duplication gene order. Fractionation is the loss of one or the other of the initial homeologs, but not of both. The process of fractionation is associated with chromosomal rearrangements and transcriptome shock (Wang et al., 2006), and may help cluster dosesensitive genes (Thomas et al., 2006). As CNS above, but the chromosomal regions are homeologous (syntenous and paralogous) remnants of the most recent tetraploidy event (a) in the lineage (Thomas et al., 2006). BLAST results for a pairs in Arabidopsis are displayed and may be researched in a custom viewer: Subfunctionalization, defined in the text, is expected of homeologous pairs, but not in orthologous pairs. Furthermore, homeologs are under different selective constraints as compared with orthologs. Genespace is defined here as the space of an individual gene, which is a computational surrogate for cistron, where the total genespace of a genome is the sum of all of the genespaces of its genes. This gene-level genespace is computed after CNSs have been identified for a syntenic genomic region, and each CNS has been sorted to a gene: the segment of genome between the most 5 (upstream) and most 3 CNS, untranslated region (UTR) or feature, plus approximately 500 bp on each side (depending on neighboring features; Thomas et al., 2007). Within a genespace are exons, UTRs, CNSs, known motifs, positions where specific transcription factor binding sites reside and any feature that is fixed at a chromosomal locus. This non-standard term has little use in mammals because CNSs are difficult to sort to individual genes, but is particularly useful for plant research. The most inclusive term for the conserved sequence between two or more sequences without stipulations as to the extent of divergence. A CNS is a type of phylogenetic footprint. Computational method to identify local regions of sequence similarity between two or more biological sequences, where hits may or may not be collinear, and may be on either strand. Computational method to find the best possible alignment between two or more biological sequences that extends across the entire length of all sequences on the same DNA strand. If settings are not stringent enough, noise can look like syntenic conserved regions because global algorithms make all alignments collinear. DNA sequence to a particular set of biologically meaningful structures is daunting (Pearson, 2006). However, it is sometimes possible to find something meaningful about the biological function of DNA with incomplete structural knowledge by comparison with a related DNA sequence. It is at this juncture that comparative genomics can be useful because DNA that functions, without regard to mechanism, tends to have its primary sequence evolutionarily conserved (Hardison, 2000, 2003). Our purpose is limited. We have prepared a tutorial of DNA sequence comparison algorithms and data visualization tools commonly used by plant researchers. Using these tools, we identify the types of information that can be acquired, show how the ability to change alignment algorithms and parameters is crucial for discovery and illustrate how visualization of the results is almost as important as the resolution of the alignment algorithm itself. Plant (angiosperm) genomes are known to be different from mammalian or any other animal genomes in several important ways. Recent and ancient polyploidy is widespread among angiosperms. The former may be deduced from chromosome counts (Adams and Wendel, 2005), whereas detecting the latter requires a nearly complete genome sequence. Within the fully-sequenced genomes of Arabidopsis thaliana, poplar, rice and grape are the remnants of at least two ancient tetraploidies (Adams and Wendel, 2005; Bowers et al., 2003; De Bodt et al., 2005; Jaillon et al., 2007; Paterson et al., 2005; Tuskan et al., 2006). Ancient tetraploidies cannot be inferred from chromosome counts because fractionation, the mechanism of genomic content loss that naturally follows all types of DNA duplications, often returns a polyploid to a chromosomal number and gene count more like that of its pre-polyploid ancestor. In addition to polyploidy, plant genomes contain much transposon-derived DNA. Such DNA is usually only a few

3 How to usefully compare plant genomes 663 million years old, and is often found both in locally repeated blocks and spread throughout the genome. Researchers must be aware that sequences of this highly repetitive nature often obfuscate comparisons among and within plant genomes. Finally, the region around genes that contains additional non-protein coding functional sequence is structured differently in angiosperms as compared with mammals. These sequences are identified by comparison of duplicated chromosomal regions, and are often called conserved non-coding sequences (CNSs; Table 1). Mammalian CNSs are approximately 10 times larger and are much more numerous than plant CNSs when using alignment cut-offs appropriate for plant CNS discovery (Kaplinsky et al., 2002). Were the most popular CNS alignment cut-off used in animal research (100-bp long with >70% identity; Loots et al., 2000) applied to plants, plants would have nearly zero CNSs (Gao and Innan, 2004; Inada et al., 2003; Thomas et al., 2007). It follows that CNSs are more deeply conserved in the vertebrate lineage than in the angiosperm linage. Vertebrates have over a thousand enhancer-like conserved non-coding sequences that have been conserved since the divergence of human and fish 450 Mya (Goode et al., 2005; Ovcharenko et al., 2005b; Siepel et al., 2005; Woolfe et al., 2005). Although those most-conserved plant CNSs may also operate as enhancers, plants do not have such deeply conserved CNSs (Freeling et al., 2007). As originally observed by Kaplinsky et al. (2002), mammalian CNSs often occur continuously down a chromosome, so the assignment of any one of them to a particular gene is not possible using spacing alone. Work on maize rice (Inada et al., 2003, Guo and Moose, 2003), Brachypodium rice (Bossolini et al., 2007) and especially alignments of the two most recent post-tetraploid genomes within Arabidopsis (Freeling, 2007; Thomas et al., 2007) all demonstrate that almost all plant CNSs cluster near one gene: this cluster of conservation has been used to estimate what we call a single genespace. This non-standard term is defined in Table 1. The reasons why plants, as compared with mammals, have less conserved sequence between genes is not known, but this articulated pattern of conservation permits assigning CNSs to genes, and this information is powerful. For example, the most CNS-rich genes in Arabidopsis are transcription factors known to be necessary for response to environmental signals (Freeling et al., 2007). The concept of synteny is essential to any comparison of homologous genes or chromosomes. Given the inherent complexity of this term, the definition that follows is simply how we use this term. Two or more once-duplicated sequences are said to be syntenic when it is possible, using extant genomic data, to reconstruct a valid ancestral sequence from which the sequences originated. When two chromosomal regions have mainly co-linear genes or other features, they are obviously derived from a common ancestral genomic region and are considered syntenic. In reality, duplication is followed by an evolutionary winnowing process (called fractionation, see Table 1) that includes gene loss, inversions, translocations, insertions, deletions and epigenetic marks. This results in a loss of collinearity of genes and other features, but it is often possible to reconstruct a putative ancestor nevertheless. An outgroup genome is often necessary to prove synteny, especially if the remaining duplicate regions share zero or few conserved sequences. When this reconstruction is possible, the duplicate regions are called syntenous or syntenic. Only the post-duplication movement of single genomic features to another genomic region destroys our ability to detect synteny. Because of ancient polyploidies in all angiosperms, synteny is evidenced within plant genomes, and not just between them. The tutorial that follows uses synteny between homeologous (i.e. syntenic, paralogous; Table 1) genomic features in several ways to identify patterns of evolution in plant genomes. Duplication may be of varying degrees of completeness: local (tandem), segmental, whole chromosome and whole genome (polyploidy). Each sort of duplication has very different selective constraints (Koonin, 2005) and dosage effect/compensation expectations (Birchler et al., 2005; see Freeling and Thomas, 2006). For example, prevalent gene conversion makes comparisons among locally duplicated genes challenging because it unlinks their date of origination from their observed degree of divergence (exemplified in yeast; Gao and Innan, 2004). Some homologous DNA sequence comparisons are meaningful only if the DNA sequences have diverged to a useful level. In theory, DNA sequence without specific function will either accumulate point mutations at the background rate of the region or may be deleted altogether (if such a mechanism operates). The former is true for many third codon position base substitutions in the protein coding sequence, and is true for all of the non-functional sequence. As the non-functional sequence changes more quickly than the functional sequence, there is a point in evolutionary time that conservation of the sequence is evidence of function. Conversely, if the level of sequence divergence is small, conservation is expected because of carry-over. Although adequate divergence is essential, it is important to realize that there can be too much. For example, homologous regulatory sites are known to lose sequence similarity even though binding function is conserved (called binding site turnover ; Ludwig et al., 2005; Moses et al., 2006). There is certainly a window of useful divergence when comparing plant non-coding sequences. The three first papers on plant CNS discovery (Guo and Moose, 2003; Inada et al., 2003; Kaplinsky et al., 2002) established the maize rice divergence time as being appropriate for CNS detection, and argued that maize rice diverged to approximately the same extent as mouse man. Sufficient divergence for detecting CNSs in plants is indicated when the

4 664 Eric Lyons and Michael Freeling Table 2 Alignment and visualization tools used in the tutorial Alignment algorithm Algorithms type Visualization Web service Avid (Bray et al., 2003) Global VISTA (Mayor et al., 2000) BLASTN (Altschul et al., 1990; Local GELO * Tatusova and Madden, 1999) BLASTZ (Schwartz et al., 2000, 2003) Local GELO * DiAlign (Morgenstern, 1999; Global ABC (Couper et al., 2004) Morgenstern et al., 1998; Pohler et al., 2005) Lagan/Shuttle Lagan Global VISTA (Brudno et al., 2003a,b) Mulan (Ovcharenko et al., 2005a) Local MULAN * GELO is our own visualization package (to be published elsewhere). average BLAST high-scoring segment pair (HSP) between orthologous/homeologous coding regions is approximately 85% identical in nucleotide sequence (unpublished rule-ofthumb, M. Freeling). However, the definitive find plant CNS settings may always be adjusted so that only significant noncoding alignments are detected. Subfunctionalization is the natural process whereby duplicate cis-acting units of function (e.g. exons and enhancers) tend to lose dispensable sequences in a compensatory fashion (Force et al., 1999; Lynch and Force, 2000). This results in the full set of functions of the ancestral gene being divided between both duplicates so no one gene is complete. Subfunctionalization of cis-acting regulatory DNA sequences has been noticed in plants (Haberer et al., 2004; Langham et al., 2004). When paralogs are aligned, subfunctionalized regions of the sequence cannot be seen because they exist in only one of the two duplicates. An appropriate outgroup capable of better representing the ancestor of the duplicates is required to identify subfunctionalized sequences. Comparing genomic sequence The workhorses of comparative genomics are sequence alignment algorithms. Sequence alignment algorithms break into two major classes: global and local. Global alignments (Needleman and Wunsch, 1970) generate the best alignment across the whole length of the sequences, whereas local alignments (Smith and Waterman, 1981) find as many best-subsequence alignments as possible. The usefulness of results generated by these two classes of alignment algorithms depends on the type of genomic region analyzed, and how much false-positive noise is retained in the results. In general, if the compared regions are believed to be similar across their entire lengths, then global alignment algorithms are preferred. Cases of inversion and local duplication both common in syntenic regions violate this assumption of collinearity, and local alignment algorithms are generally preferred. Also, different algorithms in each class are optimized for different alignment tasks. To highlight these differences in algorithm classes and optimizations, we use six alignment algorithms (three global and three local) in the tutorial. These are listed in Table 2. Visualization software We are beginning to see the development of modular alignment visualization software that can be used with the output from any sequence alignment algorithm. VISTA (Mayor et al., 2000) is a prime example of this paradigm and has been used for visualizing alignment results from several algorithms. Similarly, we have developed our own genome visualization module, GELO, which we use in the tutorial for visualizing BLAST results, and which is now being used to display the results from several alignment algorithms. Table 3 Links for regenerating and modifying the BLASTN and BLASTZ examples used in the tutorial, and for obtaining sequence and annotations files in FASTA and GAF (Gene Annotation Format) format, respectively Figure 1. CNS detection Figure 2. HSP spike filer Figure 3. Subfunctionalization Figure 4. Synteny Figure 5. Annotation error Figure 6. Inversion Figure 7. Local duplication a: b: a: b: c: d: a: b: c: a: b: a: a: b: These output files may be exported to other sequence analysis applications.

5 How to usefully compare plant genomes 665 (a) BLASTN (settings for plant CNS discovery), local alignment HSP Gene model (b) BLASTZ (default), local aligment Sequence similarity Gene model (c) Mulan (100 bp, 70% sequence identity, animal CNS settings), local alignment Gene model (d) Sequence similarity Mulan (15 bp, 70% sequence identity), local alignment Gene model Sequence similarity (e) Chaos-DiAlign (default), global alignment Sequence similarity (f) Avid (default), global alignment Gene model Sequence similarity (g) Lagan (default), global alignment Gene model Sequence similarity Figure 1. Detecting conserved.ncon-coding sequences (CNSs) in plants with various sequence comparison algorithms, settings and visualization software. Each analyzes the genespace from an Arabidopsis pair of transcription factor genes (TAIR version 7 At2g18550 and At4g3740) derived from its most recent polyploidy. (a) Shows alignment to both genomic regions; (b g) shows alignment to the genespace of At4g3740 only. (a) BLASTN using CNS discovery settings for plants (-W 7 -G 5 -E 2 -q )2 -r 1) and a 15-bp spike sequence; GELO visualization. CNSs identified by Thomas et al. (2007) are highlighted by blue double arrows. (b) BLASTZ (default settings) and GELO visualization. (c) MULAN using mammalian CNS discovery settings of 100 bp, 70% sequence identity. (d) MULAN using plant CNS discovery settings of 15 bp, 70% sequence identity. (e) Chaos-DiAlign (default) and ABC visualization. (f) Avid (default settings) and VISTA visualization. (g) Lagan (default settings) and VISTA visualization. Although alignment figures were aligned with respect to one another for easy cross-comparison of identified regions of similarity, the output from DiAlign introduced gaps in the genomic region from chromosome 4 that extended the length of this region.

6 666 Eric Lyons and Michael Freeling (a) BLASTN, e-value cutoff based on 12bp spike sequence (b) BLASTN, e-value cutoff based on 13bp spike sequence (c) BLASTN, e-value cutoff based on 14bp spike sequence (d) BLASTN, e-value cutoff based on 15bp spike sequence Figure 2. Rising above the noise. Conserved non-coding sequence (CNS) discovery in Arabidopsis between the homeologous pair of genes At1g01030 and At4g01500, including 5000 nucleotides upstream and downstream of each gene. BLASTN was used to find regions of sequence similarity (-W 7 -G 5 -E 2 -q )2 -r 1). These comparisons use a high-scoring segment pair (HSP) filter devised by Thomas et al. (2007), based on e-value cut-off values calculated by spiking the sequences with a known exact match sequence of variable length, and removing any HSP with an e value greater than the HSP containing the spike sequence. Empirically evaluating the results shows that the 15-bp spike sequence is appropriate for removing noise from the analysis. (a) e-value cut-off based on a 12-bp spike sequence. (b) e-value cut-off based on a 13-bp spike sequence. (c) e-value cut-off based on a 14-bp spike sequence. (d) e-value cut-off based on a 15-bp spike sequence.

7 How to usefully compare plant genomes 667 Figure 3. Subfunctionalization of conserved non-coding sequences (CNSs) illustrated via GenBank accession numbers. BLASTN comparison of three homologous gene regions using plant CNS settings and a 15-bp spike. Two maize homeologs are compared with a rice outgroup ortholog (GenBank accessions AY180106, AY and AP003287, respectively). GEvo permits using the reverse complement of any sequence (along with its annotations), and permits selecting a reference sequence (in this case, rice). BLAST high-scoring segment pairs (HSPs) are blue and green numbered boxes. CNSs that have subfunctionalized are indicted within purple ovals. Subfunctionalization within these genespaces has been discovered previously (Langham et al., 2004). Note the holes in the 5 region of the upper maize homeolog (AY180106): these are probably recent transposon insertions. Tutorial In the tradition of online tutorials for web applications, the following short manual is written colloquially. The reader becomes you at this point in the discourse and we provide you links to our web application for regenerating our examples and figures, as well as generating sequence and annotation sets for import into other comparative genomic tools (Table 3). There are many tools available and these have been reviewed elsewhere (recently by Pollard et al., 2006). To illustrate differences and similarities between different commonly used alignment algorithms, we chose six of them for our examples (Table 2). Please note that there are several ways to import sequences (and associated annotations) into our, and most other, web applications: retrieval from a local database, import from GenBank via an accession number, or directly submitting a sequence in FASTA or GenBank format. Detecting CNSs Figure 1 shows the results from the algorithms in Table 2 applied to a pair of genespaces in Arabidopsis: these genespaces from chromosomes 2 and 4, which contain several kb of sequence, were chosen because they are homeologous and, using BLASTN terminology (Figure 1a), they contain 14 short HSPs and three coding sequence (CDS; reading frame) HSPs. Almost all HSPs are collinear. These are typical plant CNSs (except that HSP2 is actually a microrna gene), and any detailed analysis of these genes should detect these dispersed, short stretches of non-coding DNA sequence conservation. This BLASTN analysis (Figure 1a), using the settings and noise filter defined in Table 1, is our baseline for comparison with the other five alignment algorithms. For each of the other algorithms, only the alignment to the genespace on chromosome 4 is shown. BLASTZ (Figure 1b) identified the pair of homeologous coding sequences with a single HSP that covered the entire gene model and extended into its 5 non-coding sequence. Although this 5 extension covered one CNS identified by BLASTN (HSP8, Figure 1a), it failed to find any distal CNSs. Although we can conclude that BLASTZ can easily identify putative gene homologs, it is not appropriate for finding plant CNSs. MULAN, another local alignment tool, provides a VISTA-like visualization of results for identifying CNSs that can filter the alignments based on the minimum length of CNSs and their percentage sequence similarity. Applying animal CNS settings (100-bp length, 70% identity,) MULAN identified the 5 -most cluster of CNSs (Figure 1c), but did not find the intervening CNSs identified by BLASTN. We changed the MULAN filter to be similar to plant CNSs settings by lowering its minimum length to 15 bp (Figure 1d). Although the 5 CNS cluster covered more sequence, MULAN still missed the same CNSs as with the animal settings.

8 668 Eric Lyons and Michael Freeling (a) (b) (c) Figure 4. Detecting synteny and fractionation. (a) BLASTN and (b) BLASTZ sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. Upper region, chromosome 1 identified by gene At1g07300; lower region, chromosome 2 identified by gene At2g These regions comprise of six pairs of homeologous genes (blue double arrows). The upper and lower regions have five and six genes, respectively, that do not have homeologs (purple ovals), and the lower region has one annotated pseudogene (orange oval). Blue numbered boxes mark regions of sequence similarity identified by BLAST. (c) Comparison of the two intragenomic syntenic regions from (a) wth a syntenic outgroup sequence from Vitus viniferai anchored by gene GSVIV Green and red numbered boxes are BLASTZ high-scoring segment pairs (HSPs) between the intragenomic regions of the in-group and the outgroup sequence. Red and green ovals and arrows identify genes and their orthologous regions in the outgroup sequence. Purple ovals identify genes not present in the outgroup sequence; the orange oval is a pseudogene. By comparison with an outgroup sequence, fractionation of gene content becomes apparent. All homeologous gene pairs and the majority of singlet genes are represented in the outgroup sequence. Notice that many annotated grape genes are not represented in the two Arbidopsis chromosomes shown. This is expected because two tetraploidies occurred along the Arabidopsis lineage, whereas none happened along the grape lineage; there is another equally syntenic pair of Arabidopsis chromosomal regions that are the fractionation products of this segment (Jaillon et al., 2007). Figure 1(e g) show the results of the global alignment algorithms DiAlign (which uses the local alignment algorithm Chaos for anchors; Brudno and Morgenstern, 2002), Avid and Lagan. All these algorithms identified the pair of homeologous genes and the 5 distal CNS cluster. Chaos- DiAlign and Lagan found several of the intervening CNSs identified by BLASTN (although the Chaos-DiAlign server did not support adding gene annotations to the ABC visualization). Although BLASTN may not be the most appropriate alignment algorithm for all comparative genomics problems, this comparison shows that it performs well for detecting plant CNSs.

9 How to usefully compare plant genomes 669 (a) (b) Figure 5. Detecting annotation errors. (a) Alignment of a homeologous gene pair in Arabidopsis with an annotation error using BLASTN. The regions analyzed included 2500 nucleotides of the 5 and 3 regions of genes At1g07300 and At2g (b) Alignment of two syntenic intragenomic regions from (a) to the syntenic region of an outgroup (Vitus vinifera) using BLASTN. Here it is evident that the Arabidopsis gene model for At2G29640 probably encompasses two genes, the 5 section of which has been lost from the other Arabidopsis syntenic region. Where is the noise? BLASTN reduces false-positive noise in its alignments (i.e. HSPs, blast hits ) by using the concept of an expect value (e-value). An in-depth discussion of the e-value calculation in BLAST is beyond the scope of this review, so please see html for full details. To facilitate CNS research, Thomas et al. (2007); and the Arabidopsis CNS website, cnr.berkeley.edu/atcns) devised a heuristic method to efficiently filter noise from genespaces of various lengths. These workers (from this laboratory) added an identical sequence of known length (called a spike sequence) to the 3 end of the compared sequences. Using BLASTN to generate HSPs, they identified the HSP containing the spike sequence and removed all other HSPs of greater e-value. They found that a 15-bp spike sequence eliminated most of the noise from their analyses. Figure 2 shows a short syntenic region within Arabidopsis subjected to various levels of noise filtration using spike sequences of various lengths. Note that the filter with a 15-bp spike sequence eliminates all noise from the analysis (Figure 2c,d), leaving four CNSs. We leave it to you to try various other spike sequence lengths and gap/mismatch penalties using the links in Table 3. Subfunctionalization of CNSs Now that we can detect CNSs in homeologous genespaces, we will extend this to include an outgroup sequence for the purpose of identifying CNSs that are shared or fractionated/subfunctionalized. Figure 3 shows BLASTN comparisons with a 15-bp spike sequence of two homeologous maize genes (liguleless-like transcription factors retained from a tetraploidy that happened approximately 15 Ma) to a rice ortholog. In this example, each maize gene has several CNSs in common with rice that are not shared in the homeologous genespace (CNSs highlighted with purple ovals), which is evidence for subfunctionalization of the non-coding sequence of the maize genes. Synteny demonstration Figure 4(a,b) visualize synteny between two intragenomic regions of Arabidopsis using BLASTN and BLASTZ, respectively. In both analyses, there are six pairs of genes that share a high degree of sequence similarity (blue double arrows) and are collinear, demonstrating synteny. However, the results from BLASTZ are easier to interpret visually. There are several genes in each region that do not have a

10 670 Eric Lyons and Michael Freeling (a) BLASTZ (++) HSP (+-) HSP (b) Shuffle-Lagan Inversion boundary Inversion boundary Figure 6. Detecting inversions. Sequence comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis with an inversion in one region. The upper region is from chromosome 1 identified by gene At1g02690, and the lower region is from chromosome 4 identified by gene At4g (a) BLASTZ. (b) Shuffle-Lagan. Blue arrows highlight homeologs, red arrows highlight homeologs in an inverted chromosomal region and the orange arrow represents a putative non-annotated gene on chromosome 1. Both algorithms identified the inversion event. corresponding homeolog (purple ovals), which we assume is to the result of fractionation. To demonstrate fractionation comparison with an outgroup sequence is necessary. For understanding intragenomic fractionation, such an outgroup would ideally have diverged before the intragenomic duplication event and not undergone a duplication event of its own. Figure 4(c) shows an example of this using Vitis vinifera (grape; Jaillon et al., 2007). In this example, although the intragenomic regions share a subset of their gene content, the unannotated outgroup contains the majority of the gene content and evidences fractionation. In addition, the outgroup comparison allows us to infer the pre-duplication ancestral state of the intragenomic syntenic regions, and track which genes have been preserved as singlets or retained as duplicates. Expect errors in genomic sequences and annotation If you examine the homeologous gene pair At1g07300 and At2g29640 from the previous example (Figure 4b, yellow exons in gene models), you will notice that these sequences have been assigned very different exon structures with respect to one another. Looking at their shared sequence similarity, you will notice that the BLASTZ HSP covers and extends beyond the 5 end of At1g07300, and partially covers the gene model of At2g This difference may indicate that the genes are evolving in a unique fashion or that an annotation error was made. In either case, this pair of homeologs needs closer inspection. Although Arabidopsis gene models are certainly the best current models in plants, many Arabidopsis gene models are incorrect (Thomas et al., 2007). Figure 5(a) shows a pair-wise analysis of the two Arabidopsis regions using BLASTN. There are two clusters of HSPs with one set (HSPs 1 and 2) covering the entire coding region of At1g07300 and the 3 exon of At2g29640, and the other set (HSPs 3, 4, 5, 6 and 7) covers the 5 region of At1g07300 and the intronic region of At2g29640, with one HSP overlapping a middle coding exon. The general lack of congruence of gene models for these homeologs and their odd placement of sequence similarity suggests an annotation error. Checking

11 How to usefully compare plant genomes 671 (a) 3 2 (b) Figure 7. Detecting local duplications. (a) BLASTZ comparison of a genomic region with itself (chromosome 1) identifies a local duplicate (red double arrows) consisting of genes At1g07440 and At1g (b) BLASTZ comparison of a syntenic region derived from the most recent genome duplication event in Arabidopsis. The upper region is from chromosome 1 and the lower region is from chromosome 2 identified by genes At1g07440 and At2g29300, respectively. Blue numbered boxes mark regions of sequence similarity identified by BLASTZ. Blue arrows highlight syntenic paralogous gene pairs. Regions highlighted in red denote the expansion or contraction of a gene family. their annotations at TAIR ( there is full-length cdna support for At1g07300 and none for At2g This implies that the gene model for At1g07300 is correct and that At2g29640 has an annotation error. For further analysis of annotation errors uncovered through comparison of syntenic regions, an outgroup sequence is needed. Figure 5(b) shows the two intragenomic regions with an annotation error compared with an outgroup sequence (grape, V. vinifera). Here, we can see that both At1g07300 and At2g29640 have some 3 sequence similarity to the outgroup. However, At2g29640 also has 5 sequence similarity to the outgroup that is not present in the 5 region of At1g Also, the 5 cluster of Arabidopsis HSPs in the non-coding sequence is not present in the outgroup. This suggests that the 5 cluster of HSPs are CNSs and that At2g29640 may represent two genes, one of which has been retained in the syntenic Arabidopsis genomic region, and one that has been fractionated. In addition, you will notice an HSP stack (HSPs 5 8) in the comparison between At1G07300 and V. vinifera. This results from a simple sequence repeat (in this case a GAGA repeat) and the way in which BLAST identifies regions of sequence similarity. Inversions happen frequently, and break collinearity Figure 6 compares two alignment tools, BLASTZ and Shuffle- Lagan, for their ability to identify an inversion within a syntenic region. Although both algorithms were able to identify an inversion containing at least four genes as well as a putatively missed gene in one region, identifying regions of similarity is easier using the GELO visualization for BLASTZ. Local duplications are very common and can greatly clutter alignment graphics Duplications of two general types are shown in Figure 7 using BLASTZ. Figure 7(a) shows a region with a local duplication compared with itself, and Figure 7(b) shows a region with a local duplication compared with its syntenic region containing 12 local duplicates (two of which are pseudogenes). Notice that HSP1 in Figure 7(a) nearly covers the entire sequence. This is to the result of comparing a genomic region against itself. Also, notice the HSP stacks in Figure 7(b) that happen when one gene is present in many copies in the other region. Although the HSP numbers overlap and are difficult to interpret, which is a limitation of

12 672 Eric Lyons and Michael Freeling this type of visualization, the expansion of this gene family by local duplication is apparent. Conclusion Looking forward Plant biologists face an exciting future. There will be about a dozen plant genomes completed over the next few years, bringing opportunities to characterize new patterns of similarity and change in the structure and content of plant genomes. A challenge will be to associate phenotypes with specific patterns of sequence conservation. For example, many sequence motifs identified in CNSs are under active selection, but we know little else about their function, either biochemically or phenotypically. Although we know that there are different selective environments for genomes arising from polyploidy versus speciation, we do not yet understand the evolutionary constraints and consequences. However, we know that after polyploidy, genes are retained or lost based on their family type and their ancestral genomic region. Apart from knowing that gene dosage is primarily important for retention, and that subfunctionalization is particularly important after retention (Freeling, 2007), we do not understand exactly how bias of gene content occurs as a consequence of duplications. Comparative genomics is a young and vibrant field, and is especially so for plants because plant genespace is relatively less complex than that in mammals, and because tetraploidies offer many advantages for analysis. At the core of this enterprise are DNA alignment and visualization tools, some of which are reviewed here. References Adams, K.L. and Wendel, J.F. (2005) Polyploidy and genome evolution in plants. Curr. Opin. Plant Biol. 8, Altschul, S.F., Gish, W., Miller, W., Myers, E.W. and Lipman, D.J. (1990) Basic local alignment search tool. J. Mol. Biol. 215, Birchler, J.A., Riddle, N.C., Auger, D.L. and Veitia, R.A. (2005) Dosage balance in gene regulation: biological implications. Trends Genet. 21, Bossolini, E., Wicker, T., Knobel, P.A. and Keller, B. (2007) Comparison of orthologous loci from small grass genomes Brachypodium and rice: implications for wheat genomics and grass genome annotation. Plant J. 49, Bowers, J.E., Chapman, B.A., Rong, J. and Paterson, A.H. (2003) Unravelling angiosperm genome evolution by phylogenetic analysis of chromosomal duplication events. Nature, 422, Bray, N., Dubchak, I. and Pachter, L. (2003) AVID: a global alignment program. Genome Res. 13, Brudno, M. and Morgenstern, B. (2002) Fast and sensitive alignment of large genomic sequences. Proceedings of the IEEE Computer Society Bioinformatics Conference (CSB). IEEE Computer Society Press. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., NISC Comparative Sequencing Program, Green, E.D., Sidow, A. and Batzoglou, S. (2003a) LAGAN and Multi-Lagan: Efficient tools for large-scale multiple alignment of genomic DNA. Genome Res. 13, Brudno, M., Malde, S., Poliakov, A., Do, C.B., Couronne, O., Dubchak, I. and Batzoglou, S. (2003b) Glocal alignment: finding rearrangements during alignment. Bioinformatics, 19(Suppl. 1), I54 I62. Couper, G.M., Singaravelu, S.A.G. and Sidow, A. (2004) ABC: software for interactive browsing of genomic multiple sequence alignment data. BMC Bioinformatics, 5, 192. De Bodt, S., Maere, S. and Van de Peer, Y. (2005) Genome duplication and the origin of angiosperms. Trends Ecol. Evol. 20, Force, A., Lynch, M., Pickett, F.B., Amores, A., Yan, Y.L. and Postlethwait, J. (1999) Preservation of duplicate gene by complementary degenerative mutations. Genetics, 151, Freeling, M. (2007) The evolutionary position of subfunctionalization, downgraded. Genome Dyn., in press. Freeling, M., Rapaka, L., Lyons, E., Pedersen, B. and Thomas, B.C. (2007) G-boxes, bigfoot genes, and environmental response: characterization of intragenomic conserved noncoding sequences in Arabidopsis. Plant Cell, 19, Freeling, M. and Thomas, B.C. (2006) Gene-balanced duplications, like tetraploidy, provide predictable drive to increase morphological complexity. Genome Res. 16, Gao, L.Z. and Innan, H. (2004) Very low gene duplication rate in the yeast genome. Science, 306, Goode, D.K., Snell, P., Smith, S.F., Cooke, J.E. and Elgar, G. (2005) Highly conserved regulatory elements around the SHH gene may contribute to the maintenance of conserved synteny across human chromosome 7q36.3. Genomics, 86, Guo, H. and Moose, S.P. (2003) Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell, 15, Haberer, G., Hindemitt, T., Meyers, B.C. and Meyers, K.F. (2004) Trandscriptional similarities, dissimilarities and conservation of cis-acting elements in duplicated genes in Arabidopsis. Plant Physiol. 136, Hardison, R.C. (2000) Conserved noncoding sequences are reliable guides to regulatory elements. Trends Genet. 16, Hardison, R.C. (2003) Comparative Genomics. PLoS Biol. 1, E58. Inada, D.C., A. Bashir, A., Lee, C., Thomas, B.C., Ko, C., Goff, S.A. and Freeling, M. (2003) Conserved noncoding sequences in the grasses. Genome Res. 13, Jaillon, O., Aury, J., Noel, B. et al. (2007) The grapevine genome sequence suggests ancestral hexploidization in major angiosperm phyla. Nature, 449, Kaplinsky, N.J., Braun, D.M., Penterman, J., Goff, S.A. and Freeling, M. (2002) Utility and distribution of conserved noncoding sequences in the grasses. Proc. Natl Acad. Sci. USA, 99, Koonin, E.V. (2005) Orthologs, paralogs, and evolutionary genomics. Annu. Rev. Genet. 39, Langham, R.J., Walsh, J., Dunn, M., Ko, C., Goff, S.A. and Freeling, M. (2004) Genomic duplication, fractionation and the origin of regulatory novelty. Genetics, 166, Loots, G.G., Locksley, R.M., Blankespoor, C.M., Wang, Z.E., Miller, W., Rubin, E.M. and Frazer, K.A. (2000) Identification of a coordinate regulator of interleukins 4, 13, and 5 by cross-species sequence comparisons. Science, 288, Ludwig, M.Z., Palsson, A., Alekseeva, E., Bergman, C.M., Nathan, J. and Kreitman, M. (2005) Functional evolution of a cis-regulatory module. PLoS Biol. 3, e93.

13 How to usefully compare plant genomes 673 Lynch, A. and Force, A. (2000) The probability of duplicate gene preservation by subfunctionalization. Genetics, 154, Mayor, C., Brudno, M., Schwartz, J.R., Poliakov, A., Rubin, E.M., Frazer, K.A., Pachter, L.S. and Dubchak, I. (2000) VISTA: visualizing global DNA sequence alignments of arbitrary length. Bioinformatics, 16, Mayr, E. (1993) What was the Evolutionary Synthesis? Trends Ecol. Evol. 8, Morgenstern, B. (1999) DIALIGN 2: improvement of the segmentto-segment approach to multiple sequence alignment. Bioinformatics, 15, Morgenstern, B., French, K., Dress, A. and Werner, T. (1998) DIALIGN: Finding local similarities by multiple sequence alignment. Bioinformatics, 14, Moses, A.M., Pollard, D.A., Nix, D.A., Iyer, V.N., Li, X.Y., Biggin, M.D. and Eisen, M.B. (2006) Large-scale turnover of functional transcription factor binding sites in Drosophila. PLoS Comput. Biol. 2, e130. Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, Ovcharenko, I., Loots, G.G., Giardine, B.M., Hou, M., Ma, J., Hardison, R.C., Stubbs, L. and Miller, W. (2005a) Mulan: multiplesequence local alignment and visualization for studying function and evolution. Genome Res. 15, Ovcharenko, I., Loots, G.G., Nobrega, M.A., Hardison, R.C., Miller, W. and Stubbs, L. (2005b) Evolution and functional classification of vertebrate gene deserts. Genome Res. 15, Paterson, A.H., Bowers, J.E., Van de Peer, Y. and Vandepoele, K. (2005) Ancient duplication of cereal genomes. New Phytol. 165, Pearson, H. (2006) Genetics: what is a gene? Nature, 441, Pohler, D., Werner, N., Steinkamp, R. and Morgenstern, B. (2005) Multiple alignment of genomic sequences using CHAOS, DIALIGN, and ABC. Nucleic Acids Res. 33, Pollard, D., Bergman, C., Stoye, J., Celniker, S. and Eisen, M. (2006) Benchmarking tools for the alignments of functional noncoding DNA. BMC Bioinformatics, 5, 6. Schwartz, S., Zhang, Z., Frazer, K.A., Smit, A., Riemer, C., Bouck, J., Gibbs, R., Hardison, R.C. and Miller, W. (2000) PipMaker a Web server for aligning two genomic DNA sequences. Genome Res. 10, Schwartz, S., Kent, W.J., Smit, A., Zhang, Z., Baertsch, R., Hardison, R.C., Haussler, D and Miller, W. (2003) Human-mouse alignments with BLASTZ. Genome Res. 13, Siepel, A., Bejserano, G., Pedersen, J.S. et al. (2005) Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes. Genome Res. 15, Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol. 147, Tatusova, T.A. and Madden, T.L. (1999) BLAST 2 Sequences, a new tool for comparing protein and nucleotide sequences. FEMS Microbiol. Lett. 174, Thomas, B.C., Pedersen, B. and Freeling, M. (2006) Following tetraploidy in an Arabidopsis ancestor, genes were removed preferentially from one homeolog leaving clusters enriched in dose-sensitive genes. Genome Res. 16, Thomas, B.C., Rapaka, L., Lyons, E., Pedersen, B. and Freeling, M. (2007) Intragenomic conserved noncoding sequences in Arabidopsis. Proc. Natl Acad. Sci. 104, Tuskan, G.A., Difazio, S., Jansson, S. et al. (2006) The genome of black cottonwood, Populus trichocarpa (Torr. & Gray). Science, 313, Wang, J., Tian, L., Lee, H.S. et al. (2006) Genomewide nonadditive gene regulation in Arabidopsis allotetraploids. Genetics, 172, Woolfe, A., Goodson, M., Goode, D.K. et al. (2005) Highly conserved non-coding sequences are associated with vertebrate development. PLoS Biol. 3, e7.

Multiple Alignment of Genomic Sequences

Multiple Alignment of Genomic Sequences Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part

More information

Handling Rearrangements in DNA Sequence Alignment

Handling Rearrangements in DNA Sequence Alignment Handling Rearrangements in DNA Sequence Alignment Maneesh Bhand 12/5/10 1 Introduction Sequence alignment is one of the core problems of bioinformatics, with a broad range of applications such as genome

More information

BLAST. Varieties of BLAST

BLAST. Varieties of BLAST BLAST Basic Local Alignment Search Tool (1990) Altschul, Gish, Miller, Myers, & Lipman Uses short-cuts or heuristics to improve search speed Like speed-reading, does not examine every nucleotide of database

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors

Contact 1 University of California, Davis, 2 Lawrence Berkeley National Laboratory, 3 Stanford University * Corresponding authors Phylo-VISTA: Interactive Visualization of Multiple DNA Sequence Alignments Nameeta Shah 1,*, Olivier Couronne 2,*, Len A. Pennacchio 2, Michael Brudno 3, Serafim Batzoglou 3, E. Wes Bethel 2, Edward M.

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Homolog. Orthologue. Comparative Genomics. Paralog. What is Comparative Genomics. What is Comparative Genomics

Homolog. Orthologue. Comparative Genomics. Paralog. What is Comparative Genomics. What is Comparative Genomics Orthologue Orthologs are genes in different species that evolved from a common ancestral gene by speciation. Normally, orthologs retain the same function in the course of evolution. Identification of orthologs

More information

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008 Comparative genomics Lucy Skrabanek ICB, WMC 6 May 2008 What does it encompass? Genome conservation transfer knowledge gained from model organisms to non-model organisms Genome evolution understand how

More information

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010 BLAST Database Searching BME 110: CompBio Tools Todd Lowe April 8, 2010 Admin Reading: Read chapter 7, and the NCBI Blast Guide and tutorial http://www.ncbi.nlm.nih.gov/blast/why.shtml Read Chapter 8 for

More information

Comparative genomics: Overview & Tools + MUMmer algorithm

Comparative genomics: Overview & Tools + MUMmer algorithm Comparative genomics: Overview & Tools + MUMmer algorithm Urmila Kulkarni-Kale Bioinformatics Centre University of Pune, Pune 411 007. urmila@bioinfo.ernet.in Genome sequence: Fact file 1995: The first

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Multiple Genome Alignment by Clustering Pairwise Matches

Multiple Genome Alignment by Clustering Pairwise Matches Multiple Genome Alignment by Clustering Pairwise Matches Jeong-Hyeon Choi 1,3, Kwangmin Choi 1, Hwan-Gue Cho 3, and Sun Kim 1,2 1 School of Informatics, Indiana University, IN 47408, USA, {jeochoi,kwchoi,sunkim}@bio.informatics.indiana.edu

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM)

Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM) Supplemental Information for Pramila et al. Periodic Normal Mixture Model (PNM) The data sets alpha30 and alpha38 were analyzed with PNM (Lu et al. 2004). The first two time points were deleted to alleviate

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Lineage specific conserved noncoding sequences in plants

Lineage specific conserved noncoding sequences in plants Lineage specific conserved noncoding sequences in plants Nilmini Hettiarachchi Department of Genetics, SOKENDAI National Institute of Genetics, Mishima, Japan 20 th June 2014 Conserved Noncoding Sequences

More information

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Comparative genomics and proteomics Species available Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are: Vertebrates: human, chimpanzee, mouse, rat,

More information

Sabarinath Subramaniam. A dissertation submitted in partial satisfaction of the. requirements for the degree of. Doctor of Philosophy.

Sabarinath Subramaniam. A dissertation submitted in partial satisfaction of the. requirements for the degree of. Doctor of Philosophy. Patterns of computed conserved noncoding sequence loss following the paleopolyploidies in the maize and Brassica lineages and their functional consequences By Sabarinath Subramaniam A dissertation submitted

More information

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Supplementary Note S2 Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata. Phylogenetic trees reconstructed by a variety of methods from either single-copy orthologous loci (Class

More information

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons

Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Effective Cluster-Based Seed Design for Cross-Species Sequence Comparisons Leming Zhou and Liliana Florea 1 Methods Supplementary Materials 1.1 Cluster-based seed design 1. Determine Homologous Genes.

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on: 17 Non-collinear alignment This exposition is based on: 1. Darling, A.E., Mau, B., Perna, N.T. (2010) progressivemauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One 5(6):e11147.

More information

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye! Today Gene regulation Synteny Good bye! Gene regulation What governs gene transcription? Genes active under different circumstances. Gene regulation What governs gene transcription? Genes active under

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

TE content correlates positively with genome size

TE content correlates positively with genome size TE content correlates positively with genome size Mb 3000 Genomic DNA 2500 2000 1500 1000 TE DNA Protein-coding DNA 500 0 Feschotte & Pritham 2006 Transposable elements. Variation in gene numbers cannot

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Research Proposal Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family. Name: Minjal Pancholi Howard University Washington, DC. June 19, 2009 Research

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES

NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES Conery, J.S. and Lynch, M. Nucleotide substitutions and evolution of duplicate genes. Pacific Symposium on Biocomputing 6:167-178 (2001). NUCLEOTIDE SUBSTITUTIONS AND THE EVOLUTION OF DUPLICATE GENES JOHN

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Genome-wide analysis of the MYB transcription factor superfamily in soybean

Genome-wide analysis of the MYB transcription factor superfamily in soybean Du et al. BMC Plant Biology 2012, 12:106 RESEARCH ARTICLE Open Access Genome-wide analysis of the MYB transcription factor superfamily in soybean Hai Du 1,2,3, Si-Si Yang 1,2, Zhe Liang 4, Bo-Run Feng

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Evolution at the nucleotide level: the problem of multiple whole-genome alignment Human Molecular Genetics, 2006, Vol. 15, Review Issue 1 doi:10.1093/hmg/ddl056 R51 R56 Evolution at the nucleotide level: the problem of multiple whole-genome alignment Colin N. Dewey 1, * and Lior Pachter

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species. Supplementary Figure 1 Icm/Dot secretion system region I in 41 Legionella species. Homologs of the effector-coding gene lega15 (orange) were found within Icm/Dot region I in 13 Legionella species. In four

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Case Study. Who s the daddy? TEACHER S GUIDE. James Clarkson. Dean Madden [Ed.] Polyploidy in plant evolution. Version 1.1. Royal Botanic Gardens, Kew

Case Study. Who s the daddy? TEACHER S GUIDE. James Clarkson. Dean Madden [Ed.] Polyploidy in plant evolution. Version 1.1. Royal Botanic Gardens, Kew TEACHER S GUIDE Case Study Who s the daddy? Polyploidy in plant evolution James Clarkson Royal Botanic Gardens, Kew Dean Madden [Ed.] NCBE, University of Reading Version 1.1 Polypoidy in plant evolution

More information

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona (tgabaldon@crg.es) http://gabaldonlab.crg.es Homology the same organ in different animals under

More information

Molecular evolution - Part 1. Pawan Dhar BII

Molecular evolution - Part 1. Pawan Dhar BII Molecular evolution - Part 1 Pawan Dhar BII Theodosius Dobzhansky Nothing in biology makes sense except in the light of evolution Age of life on earth: 3.85 billion years Formation of planet: 4.5 billion

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids 1[W]

Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids 1[W] Bioinformatics Finding and Comparing Syntenic Regions among Arabidopsis and the Outgroups Papaya, Poplar, and Grape: CoGe with Rosids 1[W] Eric Lyons*, Brent Pedersen, Josh Kane, Maqsudul Alam, Ray Ming,

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Comparative Genomics II

Comparative Genomics II Comparative Genomics II Advances in Bioinformatics and Genomics GEN 240B Jason Stajich May 19 Comparative Genomics II Slide 1/31 Outline Introduction Gene Families Pairwise Methods Phylogenetic Methods

More information

Bioinformatics Exercises

Bioinformatics Exercises Bioinformatics Exercises AP Biology Teachers Workshop Susan Cates, Ph.D. Evolution of Species Phylogenetic Trees show the relatedness of organisms Common Ancestor (Root of the tree) 1 Rooted vs. Unrooted

More information

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES. Alessandra Vigilante, Mara Sangiovanni, Chiara Colantuono, Luigi Frusciante and Maria Luisa Chiusano Dept. of Soil, Plant,

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly

Drosophila melanogaster and D. simulans, two fruit fly species that are nearly Comparative Genomics: Human versus chimpanzee 1. Introduction The chimpanzee is the closest living relative to humans. The two species are nearly identical in DNA sequence (>98% identity), yet vastly different

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal Genômica comparativa João Carlos Setubal IQ-USP outubro 2012 11/5/2012 J. C. Setubal 1 Comparative genomics There are currently (out/2012) 2,230 completed sequenced microbial genomes publicly available

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

C3020 Molecular Evolution. Exercises #3: Phylogenetics

C3020 Molecular Evolution. Exercises #3: Phylogenetics C3020 Molecular Evolution Exercises #3: Phylogenetics Consider the following sequences for five taxa 1-5 and the known outgroup O, which has the ancestral states (note that sequence 3 has changed from

More information

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition

Comparative Genomics. Chapter for Human Genetics - Principles and Approaches - 4 th Edition Chapter for Human Genetics - Principles and Approaches - 4 th Edition Editors: Friedrich Vogel, Arno Motulsky, Stylianos Antonarakis, and Michael Speicher Comparative Genomics Ross C. Hardison Affiliations:

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona Toni Gabaldón Contact: tgabaldon@crg.es Group website: http://gabaldonlab.crg.es Science blog: http://treevolution.blogspot.com

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships Chapter 26: Phylogeny and the Tree of Life You Must Know The taxonomic categories and how they indicate relatedness. How systematics is used to develop phylogenetic trees. How to construct a phylogenetic

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

Comparing whole genomes

Comparing whole genomes BioNumerics Tutorial: Comparing whole genomes 1 Aim The Chromosome Comparison window in BioNumerics has been designed for large-scale comparison of sequences of unlimited length. In this tutorial you will

More information

RGP finder: prediction of Genomic Islands

RGP finder: prediction of Genomic Islands Training courses on MicroScope platform RGP finder: prediction of Genomic Islands Dynamics of bacterial genomes Gene gain Horizontal gene transfer Gene loss Deletion of one or several genes Duplication

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title Automated whole-genome multiple alignment of rat, mouse, and human Permalink https://escholarship.org/uc/item/1z58c37n

More information

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada Multiple Whole Genome Alignments Without a Reference Organism Inna Dubchak 1,2, Alexander Poliakov 1, Andrey Kislyuk 3, Michael Brudno 4* 1 Genome Sciences Division, Lawrence Berkeley National Laboratory,

More information

Practical considerations of working with sequencing data

Practical considerations of working with sequencing data Practical considerations of working with sequencing data File Types Fastq ->aligner -> reference(genome) coordinates Coordinate files SAM/BAM most complete, contains all of the info in fastq and more!

More information

Genetic transcription and regulation

Genetic transcription and regulation Genetic transcription and regulation Central dogma of biology DNA codes for DNA DNA codes for RNA RNA codes for proteins not surprisingly, many points for regulation of the process DNA codes for DNA replication

More information

Synteny Portal Documentation

Synteny Portal Documentation Synteny Portal Documentation Synteny Portal is a web application portal for visualizing, browsing, searching and building synteny blocks. Synteny Portal provides four main web applications: SynCircos,

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Benchmarking tools for the alignment of functional

Benchmarking tools for the alignment of functional Benchmarking tools for the alignment of functional noncoding DNA. Daniel A. Pollard (dpollard@socrates.berkeley.edu) 1, Casey M. Bergman (cbergman@gen.cam.ac.uk) 2,3,,*, Jens Stoye (stoye@techfak.uni-bielefeld.de)

More information

Comparative Network Analysis

Comparative Network Analysis Comparative Network Analysis BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2016 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by

More information

BIOINFORMATICS LAB AP BIOLOGY

BIOINFORMATICS LAB AP BIOLOGY BIOINFORMATICS LAB AP BIOLOGY Bioinformatics is the science of collecting and analyzing complex biological data. Bioinformatics combines computer science, statistics and biology to allow scientists to

More information

Impact of recurrent gene duplication on adaptation of plant genomes

Impact of recurrent gene duplication on adaptation of plant genomes Impact of recurrent gene duplication on adaptation of plant genomes Iris Fischer, Jacques Dainat, Vincent Ranwez, Sylvain Glémin, Jacques David, Jean-François Dufayard, Nathalie Chantret Plant Genomes

More information

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI

GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI 1 GENOME-WIDE ANALYSIS OF CORE PROMOTER REGIONS IN EMILIANIA HUXLEYI Justin Dailey and Xiaoyu Zhang Department of Computer Science, California State University San Marcos San Marcos, CA 92096 Email: daile005@csusm.edu,

More information

8/23/2014. Phylogeny and the Tree of Life

8/23/2014. Phylogeny and the Tree of Life Phylogeny and the Tree of Life Chapter 26 Objectives Explain the following characteristics of the Linnaean system of classification: a. binomial nomenclature b. hierarchical classification List the major

More information

How to detect paleoploidy?

How to detect paleoploidy? Genome duplications (polyploidy) / ancient genome duplications (paleopolyploidy) How to detect paleoploidy? e.g. a diploid cell undergoes failed meiosis, producing diploid gametes, which selffertilize

More information

Evolutionary model for the statistical divergence of paralogous and orthologous gene pairs generated by whole genome duplication and speciation

Evolutionary model for the statistical divergence of paralogous and orthologous gene pairs generated by whole genome duplication and speciation Zhang et al. RESEARCH Evolutionary model for the statistical divergence of paralogous and orthologous gene pairs generated by whole genome duplication and speciation Yue Zhang, Chunfang Zheng and David

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

Alignment Strategies for Large Scale Genome Alignments

Alignment Strategies for Large Scale Genome Alignments Alignment Strategies for Large Scale Genome Alignments CSHL Computational Genomics 9 November 2003 Algorithms for Biological Sequence Comparison algorithm value scoring gap time calculated matrix penalty

More information

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law

Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law Divergence Pattern of Duplicate Genes in Protein-Protein Interactions Follows the Power Law Ze Zhang,* Z. W. Luo,* Hirohisa Kishino,à and Mike J. Kearsey *School of Biosciences, University of Birmingham,

More information

Big Questions. Is polyploidy an evolutionary dead-end? If so, why are all plants the products of multiple polyploidization events?

Big Questions. Is polyploidy an evolutionary dead-end? If so, why are all plants the products of multiple polyploidization events? Plant of the Day Cyperus esculentus - Cyperaceae Chufa (tigernut) 8,000 kg/ha, 720 kcal/sq m per month Top Crop for kcal productivity! One of the world s worst weeds Big Questions Is polyploidy an evolutionary

More information

Chapter 27: Evolutionary Genetics

Chapter 27: Evolutionary Genetics Chapter 27: Evolutionary Genetics Student Learning Objectives Upon completion of this chapter you should be able to: 1. Understand what the term species means to biology. 2. Recognize the various patterns

More information