Information Theoretic Distance Measures in Phylogenomics

Size: px
Start display at page:

Download "Information Theoretic Distance Measures in Phylogenomics"

Transcription

1 Information Theoretic Distance Measures in Phylogenomics Pavol Hanus, Janis Dingel, Juergen Zech, Joachim Hagenauer and Jakob C. Mueller Institute for Communications Engineering Technical University, 829 Munich, Germany Correspondence to: Institute of Medical Statistics and Epidemiology Technical University, Ismaninger Str. 22, Munich, Germany Department of Behavioural Ecology and Evolutionary Genetics Max Planck Institute for Ornithology, 8235 Starnberg, Germany Abstract A variety of distance measures has been developed in information theory, proven useful in the application to digital information systems. According to the fact, that the information for a living organism is stored digitally on the information carrier DNA, it seems intuitive to apply these methods to genome analysis. We present two applications to genetics: a compression based distance measure can be used to compute pairwise distances between genomic sequences of unequal lengths and thus recognize the content of a DNA region. The Kullback-Leibler distance will serve as basis for the estimation of evolutionary conservation across the genomes of different species in order to identify regions with potential important functionality. Moreover, we show that we can draw conclusions about the biological properties of the such analyzed sequences. I. INTRODUCTION The DNA is the primary carrier of an organisms genetic information. The DNA corresponds to a digital signal since it stores information in form of a sequence of nucleotides from a quaternary alphabet A,C,G,T). Recent advances in sequencing technology have lead to a steady growth of genomic sequence data which makes it reasonable to apply methods from Information Theory to investigate and model how genetic information is being stored, processed and transmitted. Genomic DNA can be subdivided into three main types based on functionality. The first group are regions that code for a functional protein classically defined as genic or gene regions. Nowadays, the term gene is more broadly defined and includes also coding regions with alternative splice variants and RNA genes that express RNAs of different functions without subsequent translation [1]. Furthermore, in eukaryotic organisms protein coding regions exons) are subdivided by the non-coding parts of genes introns). The second group are regions that regulate the activity of genes. The so called cis-regulatory elements i.e. promoters, terminators, activators, repressors, enhancers and silencers) synchronize start and termination of transcription of DNA into RNA and regulate the rates of transcription. These regulators are often localized in the vicinity of genes. The third group of DNA sequences includes all regions which are currently not associated with any function or regulation of genes. Beside some effects on the stability of chromosomal structures or within recombination processes little is known about their function. After the human genome was sequenced [2], the next challenge was to localize functional regions on the genome. Common strategies of finding protein coding genes is based on identifying parts of products mrna or protein) or conserved regions on genomic DNA. Although the process of annotating protein coding genes made good progress, it is still not complete [3]. Analysis of the similarities between various homologous genes has shown that regions conserved between different species CRs) are often correlated with certain functions. Finding a high number of conserved non genic regions CNGs) provides growing evidence for not yet localized functional regions within the nongenic parts of the genome [4]. In this paper we present two methods for genomic analysis using information theoretic measures. In Section II, compression distance measures based on mutual information are used to distinguish between different sequence types. In Section III, we present a method based on the Kullback-Leibler distance that predicts potentially functional regions in the DNA. Based on the results obtained with this method, we discuss results on different classes of CNGs from a biological point of view in Section IV. Section V concludes this article. II. DNA CLASSIFICATION USING COMPRESSION DISTANCE MEASURES BASED ON MUTUAL INFORMATION The idea of using compression for phylogenetic classification of whole genomes was first introduced in 21 [5]. It is based around the expectation to achieve better compression for a concatenation of two similar sequences as opposed to two dissimilar ones. In [6] compression distance measures based on mutual information were presented. The compared sequences are assumed to have been generated by different information sources and mutual information is used as similarity measure of the compared sources. Mutual information is a measure of information common to both sources S i,s j. It can be transformed into a bounded distance through normalization by the maximum possible

2 mutual information the two sources can share, resulting in ds i,s j ) = 1 IS i ;S j ) minhs i ),HS j )) = minhs i S j ),HS j S i )). 1) minhs i ),HS j )) According to Shannon s fundamental theorem on data compression the entropy rate HS) can be approximated by the compression ratio achieved for the message s generated by the source S HS) comps), 2) s where. denotes the size in bits or symbols. In the following universal lossless compression algorithms will be used. Such compressors gradually adjust their underlying general statistical model describing a whole class of sources to the individual statistics of the particular message being compressed. For example, DNAcompress [7] encodes long approximate palindromes and repeats often occurring in genomic DNA. Thus, it is particularly suited to compare sources of type DNA, since it is able to compress well a concatenation of similar DNA sequences as opposed to dissimilar ones. Consequently, the conditional entropy HS i S j ) of two different sources S i and S j will be approximated as the compression ratio achieved for the message s i when the compressor s model is trained on the message s j. The compression size of the concatenated sequences comps j,s i ) can be used for this purpose. HS i S j ) comps j,s i ) comps j ). 3) s i Since, the distance measure 1) turned out to be robust against inaccuracies of the compression based approximation, it is particularly suited for content recognition, where an unknown sequence is to be assigned to the closest sequence from a set of known sequences. To demonstrate the content recognition performance, we present the results for content recognition of non-genic regions ng), exons ex) and introns in). As content sequences the first 5, nucleotides 5kb) of concatenated sequences of each type were taken from the human chromosome 19 c19). Sequences of different sizes of each type taken from the beginning of chromosome 1 c1) were used as unknown sequences. Using DNACompress all unknown sequences were recognized correctly as shown in Table I listing the pairwise distances between all known and unknown sequences dsj C,SU i ). Some distances are greater than 1 due to the concatenation in the compression based approximation of conditional entropy 3), leading to high compression ratios if a dissimilar sequence is used for training. Similarly, compression based distance measures based on mutual information can be used to build phylogenetic trees of different species or the human mitochondrial population. Sj U \SC i c19ng-5kb c19in-5kb c19ex-5kb c1ng-3kb.4-best c1ng-13kb.65-best c1in-3kb best 1.1 c1in-13kb 1..5-best 1.7 c1ex-3kb best c1ex-13kb best TABLE I CONTENT RECOGNITION NG-EX-IN). BEST SIGNALS MAXIMAL COMPRESSION RATE. III. PREDICTION OF FUNCTIONAL ELEMENTS IN NON-CODING DNA REGIONS A. Comparative Genomics An important task in biology is to find putative functional conserved elements. Evolution of vertebrate organisms is a multiple step process of concatenated germ cell maturation and fusion steps. During germ cell maturation meiosis) the genomic information must be copied and passed on to the daughter cells. Each germ cell carries one copy of the whole genome. Early approaches to model this process were limited to the use of information from one species. Today, having available whole genomes of multiple species, a comparative approach is often used to infer functional regions. During the process of evolution, the passed genetic information DNA) is subjected to mutations that cause variations. Mutations that alter information coding for important functionality e.g. genes) are likely to diminish the organisms capability to pass on its DNA. Thus, those elements within the genome carrying information for important basic functions are less likely to successfully mutate during evolution. Consequently, by identifying conserved elements in the assembly of the genomes of several species, we find candidates that are very likely to be functional [4]. We will introduce a detection method which, in contrast to earlier approaches [8], [9], is independent of the assumption about neutral evolutionary rates and which does not require a priori tuning parameters. We propose a definition of conservation that relies on the Kullback-Leibler distance to the well defined maximum possible conservation that does not allow for any mutations to occur [1]. B. Multiple Sequence Alignment Human Rat Mouse Chicken Fugu Fig. 1. A short sample of a Multiple Sequence Alignment of five species and their phylogenetic tree. Single point mutations are not the only type of mutations changing the DNA. During evolution, large scale mutations can occur. Comparing different species, it is observed that severe rearrangements, large scale deletions, insertions and

3 duplications have reshaped their genomes. As a result, only parts of the DNA can be regarded as having evolved from a common ancestor, when considering distant species. Hence, as a preprocessing step for comparative genomic analysis, a computational procedure has to find theses sequences and align them properly. This step is commonly referred to as a Multiple Sequence Alignment MSA) and a vibrant research topic in bioinformatics [11]. In Figure 1, a multiple sequence alignment is visualized together with the phylogenetic tree it is based on. Mutations are marked by white spots and gaps account for insertions and deletions. After having aligned the sequences, the description of evolution can concentrate on small substitutions, insertions and deletions. C. Modeling of Evolution Mathematically, evolution is commonly described by a set of parameters ψ ρ = {τ ρ,t ρ,r ρ,θ ρ }. Models of evolution are thoroughly discussed in [9]. As different sections of the ancestral genome evolved differently, the evolutionary process is not homogenous, and different sections ρ are described by different parameters. Theoretically, for each position in the alignment i, a different ψ i should be assumed. The phylogenetic relation among n species is modeled by a binary tree with topography τ ρ, having n leaves. The nodes of the tree are connected by branches and the relative evolutionary distances between the nodes is described by t ρ, where t ρ u v denotes the distance between node u and v in section ρ. Different sections of the genome experience different mutation rates. The parameter θ ρ accounts for this rate heterogeneity and is modeled as a scalar multiplied to the relative distance t ρ. Single mutations that occur during the transmission between two nodes of the tree are commonly modeled by a continuous time Markov process that is described by a rate matrix R ρ. The transition probability matrix between two nodes u and v, giving the probabilities that a base is substituted by another, is then calculated by [9] P u v ρ = e θρt u v ρ R ρ. 4) Figure 2 shows a toy example of an evolutionary model.6.5 a u b R ρ1 = R ρ1 =.75 c.1 a) Section ρ a b.15 u.2 c b) Section ρ 2 Fig. 2. Example for the parametrization of evolution in two different sections of an alignment of 3 species. describing the relation among 3 different species a, b, c for two different regions ρ 1 and ρ 2 of their genome. We see that τ ρ1 = τ ρ2 and t ρ 1 = t ρ 2. Equal transition rates are assumed and expressed by R ρ1 = R ρ2. Section ρ 1 seems to be under high selective pressure as only a small amount of mutations can be observed which is modeled by θ ρ1 =.5. Section ρ 2 shows a higher mutation rate and θ ρ2 = 1. This results in different transition probability matrices according to 4). Usually, the phylogenetic tree is assumed to be constant for all sections τ ρ = τ, ρ, as well as the relative branch lengths t ρ = t, ρ. The rate matrix is often assumed to be constant for large sections of the multiple alignment, whereas the rate heterogeneity is often modeled by a stochastic process, where the θ i are sampled from a gamma distribution for each position i in the alignment [9]. D. A Distance Based Conservation Estimator In a transmission model for evolution which has parallels in a mobile communication link, we have the following situation: a single sequence {x i } from a common ancestor) is transmitted over a multipath channel evolution). Errors mutations, erasures and insertions) may occur during transmission. At the receiver, we observe the receive vector sequence {y i } the realizations of the ancestral sequence as we observe it today in the genomes of the species). The channel is characterized by the transition probabilities p y y i x i ;ψ i ) conditional on x i and parameterized over ψ i. These can efficiently be calculated using Felsensteins algorithm [9]. A column in the alignment is distributed according to p y y i ;ψ i ) = x i p y y i x i ;ψ i )px i ) and px i ), the distribution of bases in the ancestral genome, assumed to be known [9]. From this point of view, estimating the conservation of a particular DNA region amounts to the estimation of how good the transmission channel was in this region. Instead of comparing the observation to a model of neutral evolution, we will measure the distance to the maximum conservation. In a communications framework, the maximum conservation is equivalent to the case of noiseless transmission, i.e the base x i is observed unchanged in all components of the receive vector y i. In this situation, let the receive vector y i be distributed according to p y y i ;ψ ). In terms of biology, the natural pressure on a maximal conserved sequence is so high that not a single mutation is allowed to occur. For the comparison with the maximum conservation case, we estimate the evolutionary model that maximizes the likelihood of an ensemble of received vectors. In a sliding window over the observed data i.e. sequences of alignable DNA regions) Y i = [y i δ,..,y i+δ ], δ fixed, we determine the evolutionary model ˆψi that most likely led to the observed data. Assuming statistical independence among the columns of Y i : ˆψ i = arg max ψ i i+δ j=i δ logp y y j ;ψ i )). 5) We calculate the probability mass function p y y i ; ˆψ i ) for the column i in the middle of the sliding window. This distribution

4 is parameterized by ˆψ i and we compare the estimated distribution with the one corresponding to the maximum conservation process using the Kullback-Leibler distance D p y y i ; ˆψ ) i ) p y y i ;ψ ). 6) In order to obtain a score value in the range of to 1, where 1 indicates maximum conservation, we aplly a sigmoid function, as used in neural networks, to transform the distance into the final conservation score CS): c i = 1 tanh D p y y i ; ˆψ i ) p y y i ;ψ ) )). The c i is the score assigned to the column in the middle of the sliding window. The treatment of gaps in the likelihood function is a general problem in phylogenetics. Alike in earlier developed methods, in our approach gaps are treated as missing data causing the algorithm to consider only the subtree of species where data is available. E. Results Conservation Score CS) highly conserved Fig. 3. Top: Comparison of scores indicating CRs. Bottom: Visualization of the respective genomic data, a small section of an alignment of the genomes of human, mouse, rat, chicken and fugu. Fig. 3 shows our estimation of conservation and the underlying genomic data. Our distance based score signal reflects the different degrees of conservation as one can observe by comparing the signal course with the data. Our method was tested on synthetic data and compared to an established tool from Siepel and Haussler [9]. Results suggested that our method can more efficiently discriminate between the different degrees of conservation [1]. IV. CONSERVATION SCORE AND BIOLOGICAL PROPERTIES In this section we will cluster and analyse CRs longer than 4 bp identified with our method from a five species alignment [4]. A nucleotide with a conservation score CS) smaller than the exon average value of.8 is defined as conserved. A maximum gap of 2 non conserved nucleotides is tolerated and does not interrupt a CR. We are mostly interested in still unknown functional modules, CRs with any possible known function are excluded consequently from any analysis. Therefore our total amount of CNGs was screened first by the RepeatMasker, a program searching for interspersed repeats and low complexity DNA sequences masking nearly one half of the human genome[12]. All CRs localized within regions annotated as known genes were removed [13]. To find relations between the CS of CNGs and putative functional modules we first must prove that the CS is associated with biological properties in general. CRs without any open reading frame ORF) have the property of producing no proteins. Therefore 3423 CNGs lacking an ORF and CNGs with an ORF were selected from our total amount of CNGs. The distribution of the empirical mean CS and its standard deviation of both groups are compared in Fig. 4. The group of CNGs without any ORF is assumed to hold many cis-regulatory modules with short interspersed but highly conserved protein binding sites. These characteristics are reflected in the increased CS and a more heterogenous distribution of subcrs within the CNG i.e. increase of standard deviation of CS) compared with CNGs with at least one ORF. The latter group of CNGs may contain CRs of undiscovered pseudogenes or antisense RNA genes descending from former genes. Therefore one would expect relative large and homogeneous conserved domains. From the findings it is obvious that CNG subgroups with different biological properties can be discriminated by their CS distribution. Concerning the fact that one crucial point during identification and localization of functional regions is clustering of candidate regions this is an interesting observation. By now, biological annotation of sequenced genomes is not yet complete and standardized properly. Best gene-predicting programs are able to detect only 85% of all known exons [3]. Annotation of other functional elements or nonfunctional elements of the genome such as pseudogenes or non-translated RNA genes is even worse. Also little is known about the exact localization of cis-regulatory elements [14]. Most publications searching for novel biological properties of DNA sequences are based on annotations or on statistical properties like, nucleotide frequency, known protein binding sites or sequence similarity. Information theoretic distance measures may be an useful new tool in this field. But the key to use them in a proper way is to train them with sequences of known biological properties and to learn about the implications of the resulting patterns. Further properties which should be examined are the capability of DNA to form secondary structures and putative sites of chemical modifications. V. CONCLUSION AND PERSPECTIVES We presented a distance measure based on a trained compression algorithm that is able to classify the type non-genic, introns, exons) of an unknown DNA sequence. Moreover, we developed a sensitive conservation score to estimate the gen-

5 15 CNGs with ORF 12688, 78.75%) 15 CNGs with ORF 12688, 78.75%) Mean of CS Standard Deviation of CS 3 CNGs without ORF 3423, 21.25%) 3 CNGs without ORF 3423, 21.25%) Mean of CS Standard Deviation of CS Fig. 4. Distribution of the mean CS and its standard deviation of CNGswithout ORF) and CNGswith ORF) eral conservation of a certain DNA region without assuming neutral evolution rates. This may become a valuable tool for biologists to identify new functional regions. Additionally, this score may be very useful in identifying sequence modules which influence the conservation of other DNA regions. This could be the first step in finding evidence for an error correcting mechanism based on genomic DNA codes postulated by Battail [15]. A first experimental hint for this hypothesis was given by a recently published experiment about a reverse mutation ability in the plant Arabidopsis [16]. This plant is able to recover an information not present in the exon of their parental genome but elsewhere in the genome of the previous generation. Another interesting topic of modern biology is the identification of regulative non coding RNAs [17]. One important characteristic of such ncrnas are conserved secondary structure motives which possibly could be discriminated from common primary conserved sequences with our method. ACKNOWLEDGMENT This work was supported by the DFG projects no. HA 1358/1-1 and MU 1479/1-1 and the Bund der Freunde der TUM. REFERENCES [1] H. Pearson, Genetics: what is a gene? Nature, vol. 441, pp , 26. [2] International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome, Nature, vol. 431, pp , 24. [3] M. Brent, Genome annotation past, present, and future: how to define an ORF at each locus, Genome Research, vol. 15, pp , 25. [4] A. Siepel, G. Bejerano, and J. S. Pedersen, Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes, Genome Res., vol. 15, no. 8, pp , August 25. [5] M. Li, J. Badger, X. Chen, et al., An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, vol. 17, pp , 21. [6] Z. Dawy, J. Hagenauer, P. Hanus, et al., Mutual information based distance measures for classification and content recognition with applications to genetics, Proc. of the ICC, 25. [7] X. Chen, M. Li, B. Ma, et al., Dnacompress: fast and effective dna sequence compression, Bioinformatics, vol. 18, no. 12, pp , 22. [8] E. Margulies, M. Blanchette, D. Haussler, et al., Identification and characterization of multi-species conserved sequences, Genome Research, vol. 13, pp , 23. [9] R. Nielsen, Statistical methods in molecular evolution, Springer Science+Buisness Media, Inc., pp , 25. [1] P. Hanus, J. Dingel, J. Hagenauer, et al., An alternative method for detecting conserved regions in multiple species, Proc. of the German Conference on Bioinformatics, p. 64, 25. [11] R. Durbin, S. Eddy, A.Krogh, et al., Biological sequence analysis - probabilistic models of protein and nucleic acids, Cambridge University Press, [12] Smit, AFA, Hubley, R and P. Green, Repeatmasker open [13] E. Birney, D. Andrews, M. Caccamo, et al., Ensembl 26, Nucleic Acids Res., January 26. [14] M. Blanchette, A. R. Bataille, X. Chen, et al., Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression, Genome Research, vol. 16, pp , 26. [15] G. Battail, Information Theory and Error correcting codes in genetics and biological evolution, Introduction to Biosemiotics, Springer, November 26. [16] S. J. Lolle, J. L. Victor, J. M. Young, et al., Genome-wide nonmendelian inheritance of extra-genomic information in arabidopsis, Nature, vol. 434, pp , March 25. [17] B.-J. Yoon and P. Vaidyanathan, Computational identification and analysis of noncoding rnas, IEEE Signal Processing Magazine, vol. 24, no. 1, pp , January 27.

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington

More information

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Genomes and Their Evolution

Genomes and Their Evolution Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical

More information

Multiple Alignment of Genomic Sequences

Multiple Alignment of Genomic Sequences Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part

More information

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0

More information

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Bio 1B Lecture Outline (please print and bring along) Fall, 2007 Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution

More information

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison 10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:

More information

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

The Phylo- HMM approach to problems in comparative genomics, with examples.

The Phylo- HMM approach to problems in comparative genomics, with examples. The Phylo- HMM approach to problems in comparative genomics, with examples. Keith Bettinger Introduction The theory of evolution explains the diversity of organisms on Earth by positing that earlier species

More information

Introduction to Bioinformatics

Introduction to Bioinformatics CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics

More information

Predicting Protein Functions and Domain Interactions from Protein Interactions

Predicting Protein Functions and Domain Interactions from Protein Interactions Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

Molecular Evolution & the Origin of Variation

Molecular Evolution & the Origin of Variation Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants

More information

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

CHAPTERS 24-25: Evidence for Evolution and Phylogeny CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Genomics and bioinformatics summary. Finding genes -- computer searches

Genomics and bioinformatics summary. Finding genes -- computer searches Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence

More information

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern

More information

The Gene The gene; Genes Genes Allele;

The Gene The gene; Genes Genes Allele; Gene, genetic code and regulation of the gene expression, Regulating the Metabolism, The Lac- Operon system,catabolic repression, The Trp Operon system: regulating the biosynthesis of the tryptophan. Mitesh

More information

Full file at CHAPTER 2 Genetics

Full file at   CHAPTER 2 Genetics CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces

More information

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Chapter 15 Active Reading Guide Regulation of Gene Expression

Chapter 15 Active Reading Guide Regulation of Gene Expression Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Small RNA in rice genome

Small RNA in rice genome Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and

More information

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Multiple Choice Review- Eukaryotic Gene Expression

Multiple Choice Review- Eukaryotic Gene Expression Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule

More information

GCD3033:Cell Biology. Transcription

GCD3033:Cell Biology. Transcription Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors

More information

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models 02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput

More information

Bioinformatics Chapter 1. Introduction

Bioinformatics Chapter 1. Introduction Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

What can sequences tell us?

What can sequences tell us? Bioinformatics What can sequences tell us? AGACCTGAGATAACCGATAC By themselves? Not a heck of a lot...* *Indeed, one of the key results learned from the Human Genome Project is that disease is much more

More information

Markov Chains and Hidden Markov Models. = stochastic, generative models

Markov Chains and Hidden Markov Models. = stochastic, generative models Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,

More information

How much non-coding DNA do eukaryotes require?

How much non-coding DNA do eukaryotes require? How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics

More information

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007 Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.

More information

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data

GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data 1 Gene Networks Definition: A gene network is a set of molecular components, such as genes and proteins, and interactions between

More information

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.

Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Promoters and Enhancers Systematic discovery of transcriptional regulatory motifs

More information

A A A A B B1

A A A A B B1 LEARNING OBJECTIVES FOR EACH BIG IDEA WITH ASSOCIATED SCIENCE PRACTICES AND ESSENTIAL KNOWLEDGE Learning Objectives will be the target for AP Biology exam questions Learning Objectives Sci Prac Es Knowl

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Graph Alignment and Biological Networks

Graph Alignment and Biological Networks Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale

More information

Computational approaches for functional genomics

Computational approaches for functional genomics Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding

More information

Computational methods for predicting protein-protein interactions

Computational methods for predicting protein-protein interactions Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational

More information

Computational Identification of Evolutionarily Conserved Exons

Computational Identification of Evolutionarily Conserved Exons Computational Identification of Evolutionarily Conserved Exons Adam Siepel Center for Biomolecular Science and Engr. University of California Santa Cruz, CA 95064, USA acs@soe.ucsc.edu David Haussler Howard

More information

Eukaryotic vs. Prokaryotic genes

Eukaryotic vs. Prokaryotic genes BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Supplementary Material

Supplementary Material Supplementary Material 1 Sequence Data and Multiple Alignments The five vertebrate, four insect, two worm, and seven yeast genomes used in the analysis are summarized in Table S1, and the four genome-wide

More information

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Comparative Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand

More information

Name: Class: Date: ID: A

Name: Class: Date: ID: A Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change

More information

Information in Biology

Information in Biology Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve

More information

TE content correlates positively with genome size

TE content correlates positively with genome size TE content correlates positively with genome size Mb 3000 Genomic DNA 2500 2000 1500 1000 TE DNA Protein-coding DNA 500 0 Feschotte & Pritham 2006 Transposable elements. Variation in gene numbers cannot

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Introduction to molecular biology. Mitesh Shrestha

Introduction to molecular biology. Mitesh Shrestha Introduction to molecular biology Mitesh Shrestha Molecular biology: definition Molecular biology is the study of molecular underpinnings of the process of replication, transcription and translation of

More information

Phylogenetic inference

Phylogenetic inference Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types

More information

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.

Nature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species. Supplementary Figure 1 Icm/Dot secretion system region I in 41 Legionella species. Homologs of the effector-coding gene lega15 (orange) were found within Icm/Dot region I in 13 Legionella species. In four

More information

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON

CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON PROKARYOTE GENES: E. COLI LAC OPERON CHAPTER 13 CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON Figure 1. Electron micrograph of growing E. coli. Some show the constriction at the location where daughter

More information

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall Biology Biology 1 of 26 Fruit fly chromosome 12-5 Gene Regulation Mouse chromosomes Fruit fly embryo Mouse embryo Adult fruit fly Adult mouse 2 of 26 Gene Regulation: An Example Gene Regulation: An Example

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

AP Curriculum Framework with Learning Objectives

AP Curriculum Framework with Learning Objectives Big Ideas Big Idea 1: The process of evolution drives the diversity and unity of life. AP Curriculum Framework with Learning Objectives Understanding 1.A: Change in the genetic makeup of a population over

More information

Curriculum Links. AQA GCE Biology. AS level

Curriculum Links. AQA GCE Biology. AS level Curriculum Links AQA GCE Biology Unit 2 BIOL2 The variety of living organisms 3.2.1 Living organisms vary and this variation is influenced by genetic and environmental factors Causes of variation 3.2.2

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Organizing Life s Diversity

Organizing Life s Diversity 17 Organizing Life s Diversity section 2 Modern Classification Classification systems have changed over time as information has increased. What You ll Learn species concepts methods to reveal phylogeny

More information

Map of AP-Aligned Bio-Rad Kits with Learning Objectives

Map of AP-Aligned Bio-Rad Kits with Learning Objectives Map of AP-Aligned Bio-Rad Kits with Learning Objectives Cover more than one AP Biology Big Idea with these AP-aligned Bio-Rad kits. Big Idea 1 Big Idea 2 Big Idea 3 Big Idea 4 ThINQ! pglo Transformation

More information

An Introduction to Bioinformatics Algorithms Hidden Markov Models

An Introduction to Bioinformatics Algorithms   Hidden Markov Models Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,

More information

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA

INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology

More information

Enduring understanding 1.A: Change in the genetic makeup of a population over time is evolution.

Enduring understanding 1.A: Change in the genetic makeup of a population over time is evolution. The AP Biology course is designed to enable you to develop advanced inquiry and reasoning skills, such as designing a plan for collecting data, analyzing data, applying mathematical routines, and connecting

More information

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18

Outline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18 Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection

More information

Quantitative Bioinformatics

Quantitative Bioinformatics Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Chapter 26 Phylogeny and the Tree of Life

Chapter 26 Phylogeny and the Tree of Life Chapter 26 Phylogeny and the Tree of Life Chapter focus Shifting from the process of how evolution works to the pattern evolution produces over time. Phylogeny Phylon = tribe, geny = genesis or origin

More information

Haploid & diploid recombination and their evolutionary impact

Haploid & diploid recombination and their evolutionary impact Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007 -2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open

More information

Grade 11 Biology SBI3U 12

Grade 11 Biology SBI3U 12 Grade 11 Biology SBI3U 12 } We ve looked at Darwin, selection, and evidence for evolution } We can t consider evolution without looking at another branch of biology: } Genetics } Around the same time Darwin

More information

Frequently Asked Questions (FAQs)

Frequently Asked Questions (FAQs) Frequently Asked Questions (FAQs) Q1. What is meant by Satellite and Repetitive DNA? Ans: Satellite and repetitive DNA generally refers to DNA whose base sequence is repeated many times throughout the

More information

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?

More information

Copyright 2000 N. AYDIN. All rights reserved. 1

Copyright 2000 N. AYDIN. All rights reserved. 1 Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment

More information

Big Idea 1: The process of evolution drives the diversity and unity of life.

Big Idea 1: The process of evolution drives the diversity and unity of life. Big Idea 1: The process of evolution drives the diversity and unity of life. understanding 1.A: Change in the genetic makeup of a population over time is evolution. 1.A.1: Natural selection is a major

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Big Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes.

Big Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes. Big Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes. Enduring understanding 3.A: Heritable information provides for continuity of life. Essential

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson

Grundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)

More information