Information Theoretic Distance Measures in Phylogenomics
|
|
- Buddy Francis
- 5 years ago
- Views:
Transcription
1 Information Theoretic Distance Measures in Phylogenomics Pavol Hanus, Janis Dingel, Juergen Zech, Joachim Hagenauer and Jakob C. Mueller Institute for Communications Engineering Technical University, 829 Munich, Germany Correspondence to: Institute of Medical Statistics and Epidemiology Technical University, Ismaninger Str. 22, Munich, Germany Department of Behavioural Ecology and Evolutionary Genetics Max Planck Institute for Ornithology, 8235 Starnberg, Germany Abstract A variety of distance measures has been developed in information theory, proven useful in the application to digital information systems. According to the fact, that the information for a living organism is stored digitally on the information carrier DNA, it seems intuitive to apply these methods to genome analysis. We present two applications to genetics: a compression based distance measure can be used to compute pairwise distances between genomic sequences of unequal lengths and thus recognize the content of a DNA region. The Kullback-Leibler distance will serve as basis for the estimation of evolutionary conservation across the genomes of different species in order to identify regions with potential important functionality. Moreover, we show that we can draw conclusions about the biological properties of the such analyzed sequences. I. INTRODUCTION The DNA is the primary carrier of an organisms genetic information. The DNA corresponds to a digital signal since it stores information in form of a sequence of nucleotides from a quaternary alphabet A,C,G,T). Recent advances in sequencing technology have lead to a steady growth of genomic sequence data which makes it reasonable to apply methods from Information Theory to investigate and model how genetic information is being stored, processed and transmitted. Genomic DNA can be subdivided into three main types based on functionality. The first group are regions that code for a functional protein classically defined as genic or gene regions. Nowadays, the term gene is more broadly defined and includes also coding regions with alternative splice variants and RNA genes that express RNAs of different functions without subsequent translation [1]. Furthermore, in eukaryotic organisms protein coding regions exons) are subdivided by the non-coding parts of genes introns). The second group are regions that regulate the activity of genes. The so called cis-regulatory elements i.e. promoters, terminators, activators, repressors, enhancers and silencers) synchronize start and termination of transcription of DNA into RNA and regulate the rates of transcription. These regulators are often localized in the vicinity of genes. The third group of DNA sequences includes all regions which are currently not associated with any function or regulation of genes. Beside some effects on the stability of chromosomal structures or within recombination processes little is known about their function. After the human genome was sequenced [2], the next challenge was to localize functional regions on the genome. Common strategies of finding protein coding genes is based on identifying parts of products mrna or protein) or conserved regions on genomic DNA. Although the process of annotating protein coding genes made good progress, it is still not complete [3]. Analysis of the similarities between various homologous genes has shown that regions conserved between different species CRs) are often correlated with certain functions. Finding a high number of conserved non genic regions CNGs) provides growing evidence for not yet localized functional regions within the nongenic parts of the genome [4]. In this paper we present two methods for genomic analysis using information theoretic measures. In Section II, compression distance measures based on mutual information are used to distinguish between different sequence types. In Section III, we present a method based on the Kullback-Leibler distance that predicts potentially functional regions in the DNA. Based on the results obtained with this method, we discuss results on different classes of CNGs from a biological point of view in Section IV. Section V concludes this article. II. DNA CLASSIFICATION USING COMPRESSION DISTANCE MEASURES BASED ON MUTUAL INFORMATION The idea of using compression for phylogenetic classification of whole genomes was first introduced in 21 [5]. It is based around the expectation to achieve better compression for a concatenation of two similar sequences as opposed to two dissimilar ones. In [6] compression distance measures based on mutual information were presented. The compared sequences are assumed to have been generated by different information sources and mutual information is used as similarity measure of the compared sources. Mutual information is a measure of information common to both sources S i,s j. It can be transformed into a bounded distance through normalization by the maximum possible
2 mutual information the two sources can share, resulting in ds i,s j ) = 1 IS i ;S j ) minhs i ),HS j )) = minhs i S j ),HS j S i )). 1) minhs i ),HS j )) According to Shannon s fundamental theorem on data compression the entropy rate HS) can be approximated by the compression ratio achieved for the message s generated by the source S HS) comps), 2) s where. denotes the size in bits or symbols. In the following universal lossless compression algorithms will be used. Such compressors gradually adjust their underlying general statistical model describing a whole class of sources to the individual statistics of the particular message being compressed. For example, DNAcompress [7] encodes long approximate palindromes and repeats often occurring in genomic DNA. Thus, it is particularly suited to compare sources of type DNA, since it is able to compress well a concatenation of similar DNA sequences as opposed to dissimilar ones. Consequently, the conditional entropy HS i S j ) of two different sources S i and S j will be approximated as the compression ratio achieved for the message s i when the compressor s model is trained on the message s j. The compression size of the concatenated sequences comps j,s i ) can be used for this purpose. HS i S j ) comps j,s i ) comps j ). 3) s i Since, the distance measure 1) turned out to be robust against inaccuracies of the compression based approximation, it is particularly suited for content recognition, where an unknown sequence is to be assigned to the closest sequence from a set of known sequences. To demonstrate the content recognition performance, we present the results for content recognition of non-genic regions ng), exons ex) and introns in). As content sequences the first 5, nucleotides 5kb) of concatenated sequences of each type were taken from the human chromosome 19 c19). Sequences of different sizes of each type taken from the beginning of chromosome 1 c1) were used as unknown sequences. Using DNACompress all unknown sequences were recognized correctly as shown in Table I listing the pairwise distances between all known and unknown sequences dsj C,SU i ). Some distances are greater than 1 due to the concatenation in the compression based approximation of conditional entropy 3), leading to high compression ratios if a dissimilar sequence is used for training. Similarly, compression based distance measures based on mutual information can be used to build phylogenetic trees of different species or the human mitochondrial population. Sj U \SC i c19ng-5kb c19in-5kb c19ex-5kb c1ng-3kb.4-best c1ng-13kb.65-best c1in-3kb best 1.1 c1in-13kb 1..5-best 1.7 c1ex-3kb best c1ex-13kb best TABLE I CONTENT RECOGNITION NG-EX-IN). BEST SIGNALS MAXIMAL COMPRESSION RATE. III. PREDICTION OF FUNCTIONAL ELEMENTS IN NON-CODING DNA REGIONS A. Comparative Genomics An important task in biology is to find putative functional conserved elements. Evolution of vertebrate organisms is a multiple step process of concatenated germ cell maturation and fusion steps. During germ cell maturation meiosis) the genomic information must be copied and passed on to the daughter cells. Each germ cell carries one copy of the whole genome. Early approaches to model this process were limited to the use of information from one species. Today, having available whole genomes of multiple species, a comparative approach is often used to infer functional regions. During the process of evolution, the passed genetic information DNA) is subjected to mutations that cause variations. Mutations that alter information coding for important functionality e.g. genes) are likely to diminish the organisms capability to pass on its DNA. Thus, those elements within the genome carrying information for important basic functions are less likely to successfully mutate during evolution. Consequently, by identifying conserved elements in the assembly of the genomes of several species, we find candidates that are very likely to be functional [4]. We will introduce a detection method which, in contrast to earlier approaches [8], [9], is independent of the assumption about neutral evolutionary rates and which does not require a priori tuning parameters. We propose a definition of conservation that relies on the Kullback-Leibler distance to the well defined maximum possible conservation that does not allow for any mutations to occur [1]. B. Multiple Sequence Alignment Human Rat Mouse Chicken Fugu Fig. 1. A short sample of a Multiple Sequence Alignment of five species and their phylogenetic tree. Single point mutations are not the only type of mutations changing the DNA. During evolution, large scale mutations can occur. Comparing different species, it is observed that severe rearrangements, large scale deletions, insertions and
3 duplications have reshaped their genomes. As a result, only parts of the DNA can be regarded as having evolved from a common ancestor, when considering distant species. Hence, as a preprocessing step for comparative genomic analysis, a computational procedure has to find theses sequences and align them properly. This step is commonly referred to as a Multiple Sequence Alignment MSA) and a vibrant research topic in bioinformatics [11]. In Figure 1, a multiple sequence alignment is visualized together with the phylogenetic tree it is based on. Mutations are marked by white spots and gaps account for insertions and deletions. After having aligned the sequences, the description of evolution can concentrate on small substitutions, insertions and deletions. C. Modeling of Evolution Mathematically, evolution is commonly described by a set of parameters ψ ρ = {τ ρ,t ρ,r ρ,θ ρ }. Models of evolution are thoroughly discussed in [9]. As different sections of the ancestral genome evolved differently, the evolutionary process is not homogenous, and different sections ρ are described by different parameters. Theoretically, for each position in the alignment i, a different ψ i should be assumed. The phylogenetic relation among n species is modeled by a binary tree with topography τ ρ, having n leaves. The nodes of the tree are connected by branches and the relative evolutionary distances between the nodes is described by t ρ, where t ρ u v denotes the distance between node u and v in section ρ. Different sections of the genome experience different mutation rates. The parameter θ ρ accounts for this rate heterogeneity and is modeled as a scalar multiplied to the relative distance t ρ. Single mutations that occur during the transmission between two nodes of the tree are commonly modeled by a continuous time Markov process that is described by a rate matrix R ρ. The transition probability matrix between two nodes u and v, giving the probabilities that a base is substituted by another, is then calculated by [9] P u v ρ = e θρt u v ρ R ρ. 4) Figure 2 shows a toy example of an evolutionary model.6.5 a u b R ρ1 = R ρ1 =.75 c.1 a) Section ρ a b.15 u.2 c b) Section ρ 2 Fig. 2. Example for the parametrization of evolution in two different sections of an alignment of 3 species. describing the relation among 3 different species a, b, c for two different regions ρ 1 and ρ 2 of their genome. We see that τ ρ1 = τ ρ2 and t ρ 1 = t ρ 2. Equal transition rates are assumed and expressed by R ρ1 = R ρ2. Section ρ 1 seems to be under high selective pressure as only a small amount of mutations can be observed which is modeled by θ ρ1 =.5. Section ρ 2 shows a higher mutation rate and θ ρ2 = 1. This results in different transition probability matrices according to 4). Usually, the phylogenetic tree is assumed to be constant for all sections τ ρ = τ, ρ, as well as the relative branch lengths t ρ = t, ρ. The rate matrix is often assumed to be constant for large sections of the multiple alignment, whereas the rate heterogeneity is often modeled by a stochastic process, where the θ i are sampled from a gamma distribution for each position i in the alignment [9]. D. A Distance Based Conservation Estimator In a transmission model for evolution which has parallels in a mobile communication link, we have the following situation: a single sequence {x i } from a common ancestor) is transmitted over a multipath channel evolution). Errors mutations, erasures and insertions) may occur during transmission. At the receiver, we observe the receive vector sequence {y i } the realizations of the ancestral sequence as we observe it today in the genomes of the species). The channel is characterized by the transition probabilities p y y i x i ;ψ i ) conditional on x i and parameterized over ψ i. These can efficiently be calculated using Felsensteins algorithm [9]. A column in the alignment is distributed according to p y y i ;ψ i ) = x i p y y i x i ;ψ i )px i ) and px i ), the distribution of bases in the ancestral genome, assumed to be known [9]. From this point of view, estimating the conservation of a particular DNA region amounts to the estimation of how good the transmission channel was in this region. Instead of comparing the observation to a model of neutral evolution, we will measure the distance to the maximum conservation. In a communications framework, the maximum conservation is equivalent to the case of noiseless transmission, i.e the base x i is observed unchanged in all components of the receive vector y i. In this situation, let the receive vector y i be distributed according to p y y i ;ψ ). In terms of biology, the natural pressure on a maximal conserved sequence is so high that not a single mutation is allowed to occur. For the comparison with the maximum conservation case, we estimate the evolutionary model that maximizes the likelihood of an ensemble of received vectors. In a sliding window over the observed data i.e. sequences of alignable DNA regions) Y i = [y i δ,..,y i+δ ], δ fixed, we determine the evolutionary model ˆψi that most likely led to the observed data. Assuming statistical independence among the columns of Y i : ˆψ i = arg max ψ i i+δ j=i δ logp y y j ;ψ i )). 5) We calculate the probability mass function p y y i ; ˆψ i ) for the column i in the middle of the sliding window. This distribution
4 is parameterized by ˆψ i and we compare the estimated distribution with the one corresponding to the maximum conservation process using the Kullback-Leibler distance D p y y i ; ˆψ ) i ) p y y i ;ψ ). 6) In order to obtain a score value in the range of to 1, where 1 indicates maximum conservation, we aplly a sigmoid function, as used in neural networks, to transform the distance into the final conservation score CS): c i = 1 tanh D p y y i ; ˆψ i ) p y y i ;ψ ) )). The c i is the score assigned to the column in the middle of the sliding window. The treatment of gaps in the likelihood function is a general problem in phylogenetics. Alike in earlier developed methods, in our approach gaps are treated as missing data causing the algorithm to consider only the subtree of species where data is available. E. Results Conservation Score CS) highly conserved Fig. 3. Top: Comparison of scores indicating CRs. Bottom: Visualization of the respective genomic data, a small section of an alignment of the genomes of human, mouse, rat, chicken and fugu. Fig. 3 shows our estimation of conservation and the underlying genomic data. Our distance based score signal reflects the different degrees of conservation as one can observe by comparing the signal course with the data. Our method was tested on synthetic data and compared to an established tool from Siepel and Haussler [9]. Results suggested that our method can more efficiently discriminate between the different degrees of conservation [1]. IV. CONSERVATION SCORE AND BIOLOGICAL PROPERTIES In this section we will cluster and analyse CRs longer than 4 bp identified with our method from a five species alignment [4]. A nucleotide with a conservation score CS) smaller than the exon average value of.8 is defined as conserved. A maximum gap of 2 non conserved nucleotides is tolerated and does not interrupt a CR. We are mostly interested in still unknown functional modules, CRs with any possible known function are excluded consequently from any analysis. Therefore our total amount of CNGs was screened first by the RepeatMasker, a program searching for interspersed repeats and low complexity DNA sequences masking nearly one half of the human genome[12]. All CRs localized within regions annotated as known genes were removed [13]. To find relations between the CS of CNGs and putative functional modules we first must prove that the CS is associated with biological properties in general. CRs without any open reading frame ORF) have the property of producing no proteins. Therefore 3423 CNGs lacking an ORF and CNGs with an ORF were selected from our total amount of CNGs. The distribution of the empirical mean CS and its standard deviation of both groups are compared in Fig. 4. The group of CNGs without any ORF is assumed to hold many cis-regulatory modules with short interspersed but highly conserved protein binding sites. These characteristics are reflected in the increased CS and a more heterogenous distribution of subcrs within the CNG i.e. increase of standard deviation of CS) compared with CNGs with at least one ORF. The latter group of CNGs may contain CRs of undiscovered pseudogenes or antisense RNA genes descending from former genes. Therefore one would expect relative large and homogeneous conserved domains. From the findings it is obvious that CNG subgroups with different biological properties can be discriminated by their CS distribution. Concerning the fact that one crucial point during identification and localization of functional regions is clustering of candidate regions this is an interesting observation. By now, biological annotation of sequenced genomes is not yet complete and standardized properly. Best gene-predicting programs are able to detect only 85% of all known exons [3]. Annotation of other functional elements or nonfunctional elements of the genome such as pseudogenes or non-translated RNA genes is even worse. Also little is known about the exact localization of cis-regulatory elements [14]. Most publications searching for novel biological properties of DNA sequences are based on annotations or on statistical properties like, nucleotide frequency, known protein binding sites or sequence similarity. Information theoretic distance measures may be an useful new tool in this field. But the key to use them in a proper way is to train them with sequences of known biological properties and to learn about the implications of the resulting patterns. Further properties which should be examined are the capability of DNA to form secondary structures and putative sites of chemical modifications. V. CONCLUSION AND PERSPECTIVES We presented a distance measure based on a trained compression algorithm that is able to classify the type non-genic, introns, exons) of an unknown DNA sequence. Moreover, we developed a sensitive conservation score to estimate the gen-
5 15 CNGs with ORF 12688, 78.75%) 15 CNGs with ORF 12688, 78.75%) Mean of CS Standard Deviation of CS 3 CNGs without ORF 3423, 21.25%) 3 CNGs without ORF 3423, 21.25%) Mean of CS Standard Deviation of CS Fig. 4. Distribution of the mean CS and its standard deviation of CNGswithout ORF) and CNGswith ORF) eral conservation of a certain DNA region without assuming neutral evolution rates. This may become a valuable tool for biologists to identify new functional regions. Additionally, this score may be very useful in identifying sequence modules which influence the conservation of other DNA regions. This could be the first step in finding evidence for an error correcting mechanism based on genomic DNA codes postulated by Battail [15]. A first experimental hint for this hypothesis was given by a recently published experiment about a reverse mutation ability in the plant Arabidopsis [16]. This plant is able to recover an information not present in the exon of their parental genome but elsewhere in the genome of the previous generation. Another interesting topic of modern biology is the identification of regulative non coding RNAs [17]. One important characteristic of such ncrnas are conserved secondary structure motives which possibly could be discriminated from common primary conserved sequences with our method. ACKNOWLEDGMENT This work was supported by the DFG projects no. HA 1358/1-1 and MU 1479/1-1 and the Bund der Freunde der TUM. REFERENCES [1] H. Pearson, Genetics: what is a gene? Nature, vol. 441, pp , 26. [2] International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome, Nature, vol. 431, pp , 24. [3] M. Brent, Genome annotation past, present, and future: how to define an ORF at each locus, Genome Research, vol. 15, pp , 25. [4] A. Siepel, G. Bejerano, and J. S. Pedersen, Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes, Genome Res., vol. 15, no. 8, pp , August 25. [5] M. Li, J. Badger, X. Chen, et al., An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, vol. 17, pp , 21. [6] Z. Dawy, J. Hagenauer, P. Hanus, et al., Mutual information based distance measures for classification and content recognition with applications to genetics, Proc. of the ICC, 25. [7] X. Chen, M. Li, B. Ma, et al., Dnacompress: fast and effective dna sequence compression, Bioinformatics, vol. 18, no. 12, pp , 22. [8] E. Margulies, M. Blanchette, D. Haussler, et al., Identification and characterization of multi-species conserved sequences, Genome Research, vol. 13, pp , 23. [9] R. Nielsen, Statistical methods in molecular evolution, Springer Science+Buisness Media, Inc., pp , 25. [1] P. Hanus, J. Dingel, J. Hagenauer, et al., An alternative method for detecting conserved regions in multiple species, Proc. of the German Conference on Bioinformatics, p. 64, 25. [11] R. Durbin, S. Eddy, A.Krogh, et al., Biological sequence analysis - probabilistic models of protein and nucleic acids, Cambridge University Press, [12] Smit, AFA, Hubley, R and P. Green, Repeatmasker open [13] E. Birney, D. Andrews, M. Caccamo, et al., Ensembl 26, Nucleic Acids Res., January 26. [14] M. Blanchette, A. R. Bataille, X. Chen, et al., Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression, Genome Research, vol. 16, pp , 26. [15] G. Battail, Information Theory and Error correcting codes in genetics and biological evolution, Introduction to Biosemiotics, Springer, November 26. [16] S. J. Lolle, J. L. Victor, J. M. Young, et al., Genome-wide nonmendelian inheritance of extra-genomic information in arabidopsis, Nature, vol. 434, pp , March 25. [17] B.-J. Yoon and P. Vaidyanathan, Computational identification and analysis of noncoding rnas, IEEE Signal Processing Magazine, vol. 24, no. 1, pp , January 27.
HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University, Bloomington
More information3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM
I529: Machine Learning in Bioinformatics (Spring 2017) Content HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM Yuzhen Ye School of Informatics and Computing Indiana University,
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationGenomes and Their Evolution
Chapter 21 Genomes and Their Evolution PowerPoint Lecture Presentations for Biology Eighth Edition Neil Campbell and Jane Reece Lectures by Chris Romero, updated by Erin Barley with contributions from
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationINFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld
INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical
More informationMultiple Alignment of Genomic Sequences
Ross Metzger June 4, 2004 Biochemistry 218 Multiple Alignment of Genomic Sequences Genomic sequence is currently available from ENTREZ for more than 40 eukaryotic and 157 prokaryotic organisms. As part
More informationEarly History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species
Schedule Bioinformatics and Computational Biology: History and Biological Background (JH) 0.0 he Parsimony criterion GKN.0 Stochastic Models of Sequence Evolution GKN 7.0 he Likelihood criterion GKN 0.0
More informationBio 1B Lecture Outline (please print and bring along) Fall, 2007
Bio 1B Lecture Outline (please print and bring along) Fall, 2007 B.D. Mishler, Dept. of Integrative Biology 2-6810, bmishler@berkeley.edu Evolution lecture #5 -- Molecular genetics and molecular evolution
More information10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison
10-810: Advanced Algorithms and Models for Computational Biology microrna and Whole Genome Comparison Central Dogma: 90s Transcription factors DNA transcription mrna translation Proteins Central Dogma:
More informationPhylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?
Phylogeny and systematics Why are these disciplines important in evolutionary biology and how are they related to each other? Phylogeny and systematics Phylogeny: the evolutionary history of a species
More informationComputational Biology: Basics & Interesting Problems
Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information
More informationThe Phylo- HMM approach to problems in comparative genomics, with examples.
The Phylo- HMM approach to problems in comparative genomics, with examples. Keith Bettinger Introduction The theory of evolution explains the diversity of organisms on Earth by positing that earlier species
More informationIntroduction to Bioinformatics
CSCI8980: Applied Machine Learning in Computational Biology Introduction to Bioinformatics Rui Kuang Department of Computer Science and Engineering University of Minnesota kuang@cs.umn.edu History of Bioinformatics
More informationPredicting Protein Functions and Domain Interactions from Protein Interactions
Predicting Protein Functions and Domain Interactions from Protein Interactions Fengzhu Sun, PhD Center for Computational and Experimental Genomics University of Southern California Outline High-throughput
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationCHAPTERS 24-25: Evidence for Evolution and Phylogeny
CHAPTERS 24-25: Evidence for Evolution and Phylogeny 1. For each of the following, indicate how it is used as evidence of evolution by natural selection or shown as an evolutionary trend: a. Paleontology
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationGenomics and bioinformatics summary. Finding genes -- computer searches
Genomics and bioinformatics summary 1. Gene finding: computer searches, cdnas, ESTs, 2. Microarrays 3. Use BLAST to find homologous sequences 4. Multiple sequence alignments (MSAs) 5. Trees quantify sequence
More informationMETHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.
Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task. Modern
More informationThe Gene The gene; Genes Genes Allele;
Gene, genetic code and regulation of the gene expression, Regulating the Metabolism, The Lac- Operon system,catabolic repression, The Trp Operon system: regulating the biosynthesis of the tryptophan. Mitesh
More informationFull file at CHAPTER 2 Genetics
CHAPTER 2 Genetics MULTIPLE CHOICE 1. Chromosomes are a. small linear bodies. b. contained in cells. c. replicated during cell division. 2. A cross between true-breeding plants bearing yellow seeds produces
More informationBiological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor
Biological Networks:,, and via Relative Description Length By: Tamir Tuller & Benny Chor Presented by: Noga Grebla Content of the presentation Presenting the goals of the research Reviewing basic terms
More informationReading for Lecture 13 Release v10
Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................
More informationSequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University
Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of
More informationStudy and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis
Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis Kumud Joseph Kujur, Sumit Pal Singh, O.P. Vyas, Ruchir Bhatia, Varun Singh* Indian Institute of Information
More informationDr. Amira A. AL-Hosary
Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological
More informationChapter 15 Active Reading Guide Regulation of Gene Expression
Name: AP Biology Mr. Croft Chapter 15 Active Reading Guide Regulation of Gene Expression The overview for Chapter 15 introduces the idea that while all cells of an organism have all genes in the genome,
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationStatistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic
More informationSequence analysis and comparison
The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species
More informationSmall RNA in rice genome
Vol. 45 No. 5 SCIENCE IN CHINA (Series C) October 2002 Small RNA in rice genome WANG Kai ( 1, ZHU Xiaopeng ( 2, ZHONG Lan ( 1,3 & CHEN Runsheng ( 1,2 1. Beijing Genomics Institute/Center of Genomics and
More informationOrganization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p
Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p.110-114 Arrangement of information in DNA----- requirements for RNA Common arrangement of protein-coding genes in prokaryotes=
More informationEvolutionary Models. Evolutionary Models
Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment
More informationMultiple Choice Review- Eukaryotic Gene Expression
Multiple Choice Review- Eukaryotic Gene Expression 1. Which of the following is the Central Dogma of cell biology? a. DNA Nucleic Acid Protein Amino Acid b. Prokaryote Bacteria - Eukaryote c. Atom Molecule
More informationGCD3033:Cell Biology. Transcription
Transcription Transcription: DNA to RNA A) production of complementary strand of DNA B) RNA types C) transcription start/stop signals D) Initiation of eukaryotic gene expression E) transcription factors
More informationComputational Genomics. Systems biology. Putting it together: Data integration using graphical models
02-710 Computational Genomics Systems biology Putting it together: Data integration using graphical models High throughput data So far in this class we discussed several different types of high throughput
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationSingle alignment: Substitution Matrix. 16 march 2017
Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block
More informationWhat can sequences tell us?
Bioinformatics What can sequences tell us? AGACCTGAGATAACCGATAC By themselves? Not a heck of a lot...* *Indeed, one of the key results learned from the Human Genome Project is that disease is much more
More informationMarkov Chains and Hidden Markov Models. = stochastic, generative models
Markov Chains and Hidden Markov Models = stochastic, generative models (Drawing heavily from Durbin et al., Biological Sequence Analysis) BCH339N Systems Biology / Bioinformatics Spring 2016 Edward Marcotte,
More informationHow much non-coding DNA do eukaryotes require?
How much non-coding DNA do eukaryotes require? Andrei Zinovyev UMR U900 Computational Systems Biology of Cancer Institute Curie/INSERM/Ecole de Mine Paritech Dr. Sebastian Ahnert Dr. Thomas Fink Bioinformatics
More informationUnderstanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007
Understanding Science Through the Lens of Computation Richard M. Karp Nov. 3, 2007 The Computational Lens Exposes the computational nature of natural processes and provides a language for their description.
More informationGLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data
GLOBEX Bioinformatics (Summer 2015) Genetic networks and gene expression data 1 Gene Networks Definition: A gene network is a set of molecular components, such as genes and proteins, and interactions between
More informationTranscription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00.
Transcription Regulation and Gene Expression in Eukaryotes FS08 Pharmacenter/Biocenter Auditorium 1 Wednesdays 16h15-18h00. Promoters and Enhancers Systematic discovery of transcriptional regulatory motifs
More informationA A A A B B1
LEARNING OBJECTIVES FOR EACH BIG IDEA WITH ASSOCIATED SCIENCE PRACTICES AND ESSENTIAL KNOWLEDGE Learning Objectives will be the target for AP Biology exam questions Learning Objectives Sci Prac Es Knowl
More informationAmira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological
More informationGraph Alignment and Biological Networks
Graph Alignment and Biological Networks Johannes Berg http://www.uni-koeln.de/ berg Institute for Theoretical Physics University of Cologne Germany p.1/12 Networks in molecular biology New large-scale
More informationComputational approaches for functional genomics
Computational approaches for functional genomics Kalin Vetsigian October 31, 2001 The rapidly increasing number of completely sequenced genomes have stimulated the development of new methods for finding
More informationComputational methods for predicting protein-protein interactions
Computational methods for predicting protein-protein interactions Tomi Peltola T-61.6070 Special course in bioinformatics I 3.4.2008 Outline Biological background Protein-protein interactions Computational
More informationComputational Identification of Evolutionarily Conserved Exons
Computational Identification of Evolutionarily Conserved Exons Adam Siepel Center for Biomolecular Science and Engr. University of California Santa Cruz, CA 95064, USA acs@soe.ucsc.edu David Haussler Howard
More informationEukaryotic vs. Prokaryotic genes
BIO 5099: Molecular Biology for Computer Scientists (et al) Lecture 18: Eukaryotic genes http://compbio.uchsc.edu/hunter/bio5099 Larry.Hunter@uchsc.edu Eukaryotic vs. Prokaryotic genes Like in prokaryotes,
More informationPhylogenetic Networks, Trees, and Clusters
Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University
More information"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky
MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally
More informationSupplementary Material
Supplementary Material 1 Sequence Data and Multiple Alignments The five vertebrate, four insect, two worm, and seven yeast genomes used in the analysis are summarized in Table S1, and the four genome-wide
More informationComparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey
Comparative Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following: using related genomes
More informationInDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9
Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic
More informationModule: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment
Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment Introduction to Bioinformatics online course : IBT Jonathan Kayondo Learning Objectives Understand
More informationName: Class: Date: ID: A
Class: _ Date: _ Ch 17 Practice test 1. A segment of DNA that stores genetic information is called a(n) a. amino acid. b. gene. c. protein. d. intron. 2. In which of the following processes does change
More informationInformation in Biology
Lecture 3: Information in Biology Tsvi Tlusty, tsvi@unist.ac.kr Living information is carried by molecular channels Living systems I. Self-replicating information processors Environment II. III. Evolve
More informationTE content correlates positively with genome size
TE content correlates positively with genome size Mb 3000 Genomic DNA 2500 2000 1500 1000 TE DNA Protein-coding DNA 500 0 Feschotte & Pritham 2006 Transposable elements. Variation in gene numbers cannot
More informationBIOINFORMATICS: An Introduction
BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and
More informationStatistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences
Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department
More informationIntroduction to molecular biology. Mitesh Shrestha
Introduction to molecular biology Mitesh Shrestha Molecular biology: definition Molecular biology is the study of molecular underpinnings of the process of replication, transcription and translation of
More informationPhylogenetic inference
Phylogenetic inference Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, March 7 th 016 After this lecture, you can discuss (dis-) advantages of different information types
More informationNature Genetics: doi: /ng Supplementary Figure 1. Icm/Dot secretion system region I in 41 Legionella species.
Supplementary Figure 1 Icm/Dot secretion system region I in 41 Legionella species. Homologs of the effector-coding gene lega15 (orange) were found within Icm/Dot region I in 13 Legionella species. In four
More informationCHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON
PROKARYOTE GENES: E. COLI LAC OPERON CHAPTER 13 CHAPTER 13 PROKARYOTE GENES: E. COLI LAC OPERON Figure 1. Electron micrograph of growing E. coli. Some show the constriction at the location where daughter
More informationBiology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall
Biology Biology 1 of 26 Fruit fly chromosome 12-5 Gene Regulation Mouse chromosomes Fruit fly embryo Mouse embryo Adult fruit fly Adult mouse 2 of 26 Gene Regulation: An Example Gene Regulation: An Example
More informationCONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018
CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of
More informationOutline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/8/16
Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection
More informationAP Curriculum Framework with Learning Objectives
Big Ideas Big Idea 1: The process of evolution drives the diversity and unity of life. AP Curriculum Framework with Learning Objectives Understanding 1.A: Change in the genetic makeup of a population over
More informationCurriculum Links. AQA GCE Biology. AS level
Curriculum Links AQA GCE Biology Unit 2 BIOL2 The variety of living organisms 3.2.1 Living organisms vary and this variation is influenced by genetic and environmental factors Causes of variation 3.2.2
More informationEffects of Gap Open and Gap Extension Penalties
Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationOrganizing Life s Diversity
17 Organizing Life s Diversity section 2 Modern Classification Classification systems have changed over time as information has increased. What You ll Learn species concepts methods to reveal phylogeny
More informationMap of AP-Aligned Bio-Rad Kits with Learning Objectives
Map of AP-Aligned Bio-Rad Kits with Learning Objectives Cover more than one AP Biology Big Idea with these AP-aligned Bio-Rad kits. Big Idea 1 Big Idea 2 Big Idea 3 Big Idea 4 ThINQ! pglo Transformation
More informationAn Introduction to Bioinformatics Algorithms Hidden Markov Models
Hidden Markov Models Outline 1. CG-Islands 2. The Fair Bet Casino 3. Hidden Markov Model 4. Decoding Algorithm 5. Forward-Backward Algorithm 6. Profile HMMs 7. HMM Parameter Estimation 8. Viterbi Training
More informationSUPPLEMENTARY INFORMATION
Supplementary information S3 (box) Methods Methods Genome weighting The currently available collection of archaeal and bacterial genomes has a highly biased distribution of isolates across taxa. For example,
More informationINTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA
INTERACTIVE CLUSTERING FOR EXPLORATION OF GENOMIC DATA XIUFENG WAN xw6@cs.msstate.edu Department of Computer Science Box 9637 JOHN A. BOYLE jab@ra.msstate.edu Department of Biochemistry and Molecular Biology
More informationEnduring understanding 1.A: Change in the genetic makeup of a population over time is evolution.
The AP Biology course is designed to enable you to develop advanced inquiry and reasoning skills, such as designing a plan for collecting data, analyzing data, applying mathematical routines, and connecting
More informationIntroduction to Bioinformatics. Shifra Ben-Dor Irit Orr
Introduction to Bioinformatics Shifra Ben-Dor Irit Orr Lecture Outline: Technical Course Items Introduction to Bioinformatics Introduction to Databases This week and next week What is bioinformatics? A
More informationHomework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:
Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships
More informationOutline. Genome Evolution. Genome. Genome Architecture. Constraints on Genome Evolution. New Evolutionary Synthesis 11/1/18
Genome Evolution Outline 1. What: Patterns of Genome Evolution Carol Eunmi Lee Evolution 410 University of Wisconsin 2. Why? Evolution of Genome Complexity and the interaction between Natural Selection
More informationQuantitative Bioinformatics
Chapter 9 Class Notes Signals in DNA 9.1. The Biological Problem: since proteins cannot read, how do they recognize nucleotides such as A, C, G, T? Although only approximate, proteins actually recognize
More informationAlgorithms in Bioinformatics
Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology
More informationTHEORY. Based on sequence Length According to the length of sequence being compared it is of following two types
Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between
More informationChapter 26 Phylogeny and the Tree of Life
Chapter 26 Phylogeny and the Tree of Life Chapter focus Shifting from the process of how evolution works to the pattern evolution produces over time. Phylogeny Phylon = tribe, geny = genesis or origin
More informationHaploid & diploid recombination and their evolutionary impact
Haploid & diploid recombination and their evolutionary impact W. Garrett Mitchener College of Charleston Mathematics Department MitchenerG@cofc.edu http://mitchenerg.people.cofc.edu Introduction The basis
More informationMassachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution
Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral
More informationMathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007
-2 Transcript Alignment Assembly and Automated Gene Structure Improvements Using PASA-2 Mathangi Thiagarajan mathangi@jcvi.org Rice Genome Annotation Workshop May 23rd, 2007 About PASA PASA is an open
More informationGrade 11 Biology SBI3U 12
Grade 11 Biology SBI3U 12 } We ve looked at Darwin, selection, and evidence for evolution } We can t consider evolution without looking at another branch of biology: } Genetics } Around the same time Darwin
More informationFrequently Asked Questions (FAQs)
Frequently Asked Questions (FAQs) Q1. What is meant by Satellite and Repetitive DNA? Ans: Satellite and repetitive DNA generally refers to DNA whose base sequence is repeated many times throughout the
More informationUSING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES
USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES HOW CAN BIOINFORMATICS BE USED AS A TOOL TO DETERMINE EVOLUTIONARY RELATIONSHPS AND TO BETTER UNDERSTAND PROTEIN HERITAGE?
More informationCopyright 2000 N. AYDIN. All rights reserved. 1
Introduction to Bioinformatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr Multiple Sequence Alignment Outline Multiple sequence alignment introduction to msa methods of msa progressive global alignment
More informationBig Idea 1: The process of evolution drives the diversity and unity of life.
Big Idea 1: The process of evolution drives the diversity and unity of life. understanding 1.A: Change in the genetic makeup of a population over time is evolution. 1.A.1: Natural selection is a major
More informationAlgorithms in Computational Biology (236522) spring 2008 Lecture #1
Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??
More informationBig Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes.
Big Idea 3: Living systems store, retrieve, transmit, and respond to information essential to life processes. Enduring understanding 3.A: Heritable information provides for continuity of life. Essential
More informationSequence Alignment Techniques and Their Uses
Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this
More informationGrundlagen der Bioinformatik Summer semester Lecturer: Prof. Daniel Huson
Grundlagen der Bioinformatik, SS 10, D. Huson, April 12, 2010 1 1 Introduction Grundlagen der Bioinformatik Summer semester 2010 Lecturer: Prof. Daniel Huson Office hours: Thursdays 17-18h (Sand 14, C310a)
More information