Information Theoretic Distance Measures in Phylogenomics

Information Theoretic Distance Measures in Phylogenomics Pavol Hanus, Janis Dingel, Juergen Zech, Joachim Hagenauer and Jakob C. Mueller Institute for Communications Engineering Technical University, 829 Munich, Germany Correspondence to: Institute of Medical Statistics and Epidemiology Technical University, Ismaninger Str. 22, 81675 Munich, Germany Email: juergen.zech@tum.de Department of Behavioural Ecology and Evolutionary Genetics Max Planck Institute for Ornithology, 8235 Starnberg, Germany Abstract A variety of distance measures has been developed in information theory, proven useful in the application to digital information systems. According to the fact, that the information for a living organism is stored digitally on the information carrier DNA, it seems intuitive to apply these methods to genome analysis. We present two applications to genetics: a compression based distance measure can be used to compute pairwise distances between genomic sequences of unequal lengths and thus recognize the content of a DNA region. The Kullback-Leibler distance will serve as basis for the estimation of evolutionary conservation across the genomes of different species in order to identify regions with potential important functionality. Moreover, we show that we can draw conclusions about the biological properties of the such analyzed sequences. I. INTRODUCTION The DNA is the primary carrier of an organisms genetic information. The DNA corresponds to a digital signal since it stores information in form of a sequence of nucleotides from a quaternary alphabet A,C,G,T). Recent advances in sequencing technology have lead to a steady growth of genomic sequence data which makes it reasonable to apply methods from Information Theory to investigate and model how genetic information is being stored, processed and transmitted. Genomic DNA can be subdivided into three main types based on functionality. The first group are regions that code for a functional protein classically defined as genic or gene regions. Nowadays, the term gene is more broadly defined and includes also coding regions with alternative splice variants and RNA genes that express RNAs of different functions without subsequent translation [1]. Furthermore, in eukaryotic organisms protein coding regions exons) are subdivided by the non-coding parts of genes introns). The second group are regions that regulate the activity of genes. The so called cis-regulatory elements i.e. promoters, terminators, activators, repressors, enhancers and silencers) synchronize start and termination of transcription of DNA into RNA and regulate the rates of transcription. These regulators are often localized in the vicinity of genes. The third group of DNA sequences includes all regions which are currently not associated with any function or regulation of genes. Beside some effects on the stability of chromosomal structures or within recombination processes little is known about their function. After the human genome was sequenced [2], the next challenge was to localize functional regions on the genome. Common strategies of finding protein coding genes is based on identifying parts of products mrna or protein) or conserved regions on genomic DNA. Although the process of annotating protein coding genes made good progress, it is still not complete [3]. Analysis of the similarities between various homologous genes has shown that regions conserved between different species CRs) are often correlated with certain functions. Finding a high number of conserved non genic regions CNGs) provides growing evidence for not yet localized functional regions within the nongenic parts of the genome [4]. In this paper we present two methods for genomic analysis using information theoretic measures. In Section II, compression distance measures based on mutual information are used to distinguish between different sequence types. In Section III, we present a method based on the Kullback-Leibler distance that predicts potentially functional regions in the DNA. Based on the results obtained with this method, we discuss results on different classes of CNGs from a biological point of view in Section IV. Section V concludes this article. II. DNA CLASSIFICATION USING COMPRESSION DISTANCE MEASURES BASED ON MUTUAL INFORMATION The idea of using compression for phylogenetic classification of whole genomes was first introduced in 21 [5]. It is based around the expectation to achieve better compression for a concatenation of two similar sequences as opposed to two dissimilar ones. In [6] compression distance measures based on mutual information were presented. The compared sequences are assumed to have been generated by different information sources and mutual information is used as similarity measure of the compared sources. Mutual information is a measure of information common to both sources S i,s j. It can be transformed into a bounded distance through normalization by the maximum possible

mutual information the two sources can share, resulting in ds i,s j ) = 1 IS i ;S j ) minhs i ),HS j )) = minhs i S j ),HS j S i )). 1) minhs i ),HS j )) According to Shannon s fundamental theorem on data compression the entropy rate HS) can be approximated by the compression ratio achieved for the message s generated by the source S HS) comps), 2) s where. denotes the size in bits or symbols. In the following universal lossless compression algorithms will be used. Such compressors gradually adjust their underlying general statistical model describing a whole class of sources to the individual statistics of the particular message being compressed. For example, DNAcompress [7] encodes long approximate palindromes and repeats often occurring in genomic DNA. Thus, it is particularly suited to compare sources of type DNA, since it is able to compress well a concatenation of similar DNA sequences as opposed to dissimilar ones. Consequently, the conditional entropy HS i S j ) of two different sources S i and S j will be approximated as the compression ratio achieved for the message s i when the compressor s model is trained on the message s j. The compression size of the concatenated sequences comps j,s i ) can be used for this purpose. HS i S j ) comps j,s i ) comps j ). 3) s i Since, the distance measure 1) turned out to be robust against inaccuracies of the compression based approximation, it is particularly suited for content recognition, where an unknown sequence is to be assigned to the closest sequence from a set of known sequences. To demonstrate the content recognition performance, we present the results for content recognition of non-genic regions ng), exons ex) and introns in). As content sequences the first 5, nucleotides 5kb) of concatenated sequences of each type were taken from the human chromosome 19 c19). Sequences of different sizes of each type taken from the beginning of chromosome 1 c1) were used as unknown sequences. Using DNACompress all unknown sequences were recognized correctly as shown in Table I listing the pairwise distances between all known and unknown sequences dsj C,SU i ). Some distances are greater than 1 due to the concatenation in the compression based approximation of conditional entropy 3), leading to high compression ratios if a dissimilar sequence is used for training. Similarly, compression based distance measures based on mutual information can be used to build phylogenetic trees of different species or the human mitochondrial population. Sj U \SC i c19ng-5kb c19in-5kb c19ex-5kb c1ng-3kb.4-best.84 1.2 c1ng-13kb.65-best 1.1 1.1 c1in-3kb.93.58-best 1.1 c1in-13kb 1..5-best 1.7 c1ex-3kb 1.2 1.1.96-best c1ex-13kb.98.94.83-best TABLE I CONTENT RECOGNITION NG-EX-IN). BEST SIGNALS MAXIMAL COMPRESSION RATE. III. PREDICTION OF FUNCTIONAL ELEMENTS IN NON-CODING DNA REGIONS A. Comparative Genomics An important task in biology is to find putative functional conserved elements. Evolution of vertebrate organisms is a multiple step process of concatenated germ cell maturation and fusion steps. During germ cell maturation meiosis) the genomic information must be copied and passed on to the daughter cells. Each germ cell carries one copy of the whole genome. Early approaches to model this process were limited to the use of information from one species. Today, having available whole genomes of multiple species, a comparative approach is often used to infer functional regions. During the process of evolution, the passed genetic information DNA) is subjected to mutations that cause variations. Mutations that alter information coding for important functionality e.g. genes) are likely to diminish the organisms capability to pass on its DNA. Thus, those elements within the genome carrying information for important basic functions are less likely to successfully mutate during evolution. Consequently, by identifying conserved elements in the assembly of the genomes of several species, we find candidates that are very likely to be functional [4]. We will introduce a detection method which, in contrast to earlier approaches [8], [9], is independent of the assumption about neutral evolutionary rates and which does not require a priori tuning parameters. We propose a definition of conservation that relies on the Kullback-Leibler distance to the well defined maximum possible conservation that does not allow for any mutations to occur [1]. B. Multiple Sequence Alignment Human Rat Mouse Chicken Fugu Fig. 1. A short sample of a Multiple Sequence Alignment of five species and their phylogenetic tree. Single point mutations are not the only type of mutations changing the DNA. During evolution, large scale mutations can occur. Comparing different species, it is observed that severe rearrangements, large scale deletions, insertions and

duplications have reshaped their genomes. As a result, only parts of the DNA can be regarded as having evolved from a common ancestor, when considering distant species. Hence, as a preprocessing step for comparative genomic analysis, a computational procedure has to find theses sequences and align them properly. This step is commonly referred to as a Multiple Sequence Alignment MSA) and a vibrant research topic in bioinformatics [11]. In Figure 1, a multiple sequence alignment is visualized together with the phylogenetic tree it is based on. Mutations are marked by white spots and gaps account for insertions and deletions. After having aligned the sequences, the description of evolution can concentrate on small substitutions, insertions and deletions. C. Modeling of Evolution Mathematically, evolution is commonly described by a set of parameters ψ ρ = {τ ρ,t ρ,r ρ,θ ρ }. Models of evolution are thoroughly discussed in [9]. As different sections of the ancestral genome evolved differently, the evolutionary process is not homogenous, and different sections ρ are described by different parameters. Theoretically, for each position in the alignment i, a different ψ i should be assumed. The phylogenetic relation among n species is modeled by a binary tree with topography τ ρ, having n leaves. The nodes of the tree are connected by branches and the relative evolutionary distances between the nodes is described by t ρ, where t ρ u v denotes the distance between node u and v in section ρ. Different sections of the genome experience different mutation rates. The parameter θ ρ accounts for this rate heterogeneity and is modeled as a scalar multiplied to the relative distance t ρ. Single mutations that occur during the transmission between two nodes of the tree are commonly modeled by a continuous time Markov process that is described by a rate matrix R ρ. The transition probability matrix between two nodes u and v, giving the probabilities that a base is substituted by another, is then calculated by [9] P u v ρ = e θρt u v ρ R ρ. 4) Figure 2 shows a toy example of an evolutionary model.6.5 a u b R ρ1 = R ρ1 =.75 c.1 a) Section ρ 1 3 1 1 1 1 3 1 1 1 1 3 1 1 1 1 3.1.12 a b.15 u.2 c b) Section ρ 2 Fig. 2. Example for the parametrization of evolution in two different sections of an alignment of 3 species. describing the relation among 3 different species a, b, c for two different regions ρ 1 and ρ 2 of their genome. We see that τ ρ1 = τ ρ2 and t ρ 1 = t ρ 2. Equal transition rates are assumed and expressed by R ρ1 = R ρ2. Section ρ 1 seems to be under high selective pressure as only a small amount of mutations can be observed which is modeled by θ ρ1 =.5. Section ρ 2 shows a higher mutation rate and θ ρ2 = 1. This results in different transition probability matrices according to 4). Usually, the phylogenetic tree is assumed to be constant for all sections τ ρ = τ, ρ, as well as the relative branch lengths t ρ = t, ρ. The rate matrix is often assumed to be constant for large sections of the multiple alignment, whereas the rate heterogeneity is often modeled by a stochastic process, where the θ i are sampled from a gamma distribution for each position i in the alignment [9]. D. A Distance Based Conservation Estimator In a transmission model for evolution which has parallels in a mobile communication link, we have the following situation: a single sequence {x i } from a common ancestor) is transmitted over a multipath channel evolution). Errors mutations, erasures and insertions) may occur during transmission. At the receiver, we observe the receive vector sequence {y i } the realizations of the ancestral sequence as we observe it today in the genomes of the species). The channel is characterized by the transition probabilities p y y i x i ;ψ i ) conditional on x i and parameterized over ψ i. These can efficiently be calculated using Felsensteins algorithm [9]. A column in the alignment is distributed according to p y y i ;ψ i ) = x i p y y i x i ;ψ i )px i ) and px i ), the distribution of bases in the ancestral genome, assumed to be known [9]. From this point of view, estimating the conservation of a particular DNA region amounts to the estimation of how good the transmission channel was in this region. Instead of comparing the observation to a model of neutral evolution, we will measure the distance to the maximum conservation. In a communications framework, the maximum conservation is equivalent to the case of noiseless transmission, i.e the base x i is observed unchanged in all components of the receive vector y i. In this situation, let the receive vector y i be distributed according to p y y i ;ψ ). In terms of biology, the natural pressure on a maximal conserved sequence is so high that not a single mutation is allowed to occur. For the comparison with the maximum conservation case, we estimate the evolutionary model that maximizes the likelihood of an ensemble of received vectors. In a sliding window over the observed data i.e. sequences of alignable DNA regions) Y i = [y i δ,..,y i+δ ], δ fixed, we determine the evolutionary model ˆψi that most likely led to the observed data. Assuming statistical independence among the columns of Y i : ˆψ i = arg max ψ i i+δ j=i δ logp y y j ;ψ i )). 5) We calculate the probability mass function p y y i ; ˆψ i ) for the column i in the middle of the sliding window. This distribution

is parameterized by ˆψ i and we compare the estimated distribution with the one corresponding to the maximum conservation process using the Kullback-Leibler distance D p y y i ; ˆψ ) i ) p y y i ;ψ ). 6) In order to obtain a score value in the range of to 1, where 1 indicates maximum conservation, we aplly a sigmoid function, as used in neural networks, to transform the distance into the final conservation score CS): c i = 1 tanh D p y y i ; ˆψ i ) p y y i ;ψ ) )). The c i is the score assigned to the column in the middle of the sliding window. The treatment of gaps in the likelihood function is a general problem in phylogenetics. Alike in earlier developed methods, in our approach gaps are treated as missing data causing the algorithm to consider only the subtree of species where data is available. E. Results Conservation Score CS) 1.9.8.7.6.5.4.3.2.1 highly conserved 3 4 5 6 7 8 9 1 Fig. 3. Top: Comparison of scores indicating CRs. Bottom: Visualization of the respective genomic data, a small section of an alignment of the genomes of human, mouse, rat, chicken and fugu. Fig. 3 shows our estimation of conservation and the underlying genomic data. Our distance based score signal reflects the different degrees of conservation as one can observe by comparing the signal course with the data. Our method was tested on synthetic data and compared to an established tool from Siepel and Haussler [9]. Results suggested that our method can more efficiently discriminate between the different degrees of conservation [1]. IV. CONSERVATION SCORE AND BIOLOGICAL PROPERTIES In this section we will cluster and analyse CRs longer than 4 bp identified with our method from a five species alignment [4]. A nucleotide with a conservation score CS) smaller than the exon average value of.8 is defined as conserved. A maximum gap of 2 non conserved nucleotides is tolerated and does not interrupt a CR. We are mostly interested in still unknown functional modules, CRs with any possible known function are excluded consequently from any analysis. Therefore our total amount of CNGs was screened first by the RepeatMasker, a program searching for interspersed repeats and low complexity DNA sequences masking nearly one half of the human genome[12]. All CRs localized within regions annotated as known genes were removed [13]. To find relations between the CS of CNGs and putative functional modules we first must prove that the CS is associated with biological properties in general. CRs without any open reading frame ORF) have the property of producing no proteins. Therefore 3423 CNGs lacking an ORF and 12688 CNGs with an ORF were selected from our total amount of 16111 CNGs. The distribution of the empirical mean CS and its standard deviation of both groups are compared in Fig. 4. The group of CNGs without any ORF is assumed to hold many cis-regulatory modules with short interspersed but highly conserved protein binding sites. These characteristics are reflected in the increased CS and a more heterogenous distribution of subcrs within the CNG i.e. increase of standard deviation of CS) compared with CNGs with at least one ORF. The latter group of CNGs may contain CRs of undiscovered pseudogenes or antisense RNA genes descending from former genes. Therefore one would expect relative large and homogeneous conserved domains. From the findings it is obvious that CNG subgroups with different biological properties can be discriminated by their CS distribution. Concerning the fact that one crucial point during identification and localization of functional regions is clustering of candidate regions this is an interesting observation. By now, biological annotation of sequenced genomes is not yet complete and standardized properly. Best gene-predicting programs are able to detect only 85% of all known exons [3]. Annotation of other functional elements or nonfunctional elements of the genome such as pseudogenes or non-translated RNA genes is even worse. Also little is known about the exact localization of cis-regulatory elements [14]. Most publications searching for novel biological properties of DNA sequences are based on annotations or on statistical properties like, nucleotide frequency, known protein binding sites or sequence similarity. Information theoretic distance measures may be an useful new tool in this field. But the key to use them in a proper way is to train them with sequences of known biological properties and to learn about the implications of the resulting patterns. Further properties which should be examined are the capability of DNA to form secondary structures and putative sites of chemical modifications. V. CONCLUSION AND PERSPECTIVES We presented a distance measure based on a trained compression algorithm that is able to classify the type non-genic, introns, exons) of an unknown DNA sequence. Moreover, we developed a sensitive conservation score to estimate the gen-

15 CNGs with ORF 12688, 78.75%) 15 CNGs with ORF 12688, 78.75%) 1 5 1 5.2.4.6.8 1 Mean of CS.5.1.15.2.25.3 Standard Deviation of CS 3 CNGs without ORF 3423, 21.25%) 3 CNGs without ORF 3423, 21.25%) 25 25 2 15 1 2 15 1 5 5.2.4.6.8 1 Mean of CS.5.1.15.2.25.3 Standard Deviation of CS Fig. 4. Distribution of the mean CS and its standard deviation of CNGswithout ORF) and CNGswith ORF) eral conservation of a certain DNA region without assuming neutral evolution rates. This may become a valuable tool for biologists to identify new functional regions. Additionally, this score may be very useful in identifying sequence modules which influence the conservation of other DNA regions. This could be the first step in finding evidence for an error correcting mechanism based on genomic DNA codes postulated by Battail [15]. A first experimental hint for this hypothesis was given by a recently published experiment about a reverse mutation ability in the plant Arabidopsis [16]. This plant is able to recover an information not present in the exon of their parental genome but elsewhere in the genome of the previous generation. Another interesting topic of modern biology is the identification of regulative non coding RNAs [17]. One important characteristic of such ncrnas are conserved secondary structure motives which possibly could be discriminated from common primary conserved sequences with our method. ACKNOWLEDGMENT This work was supported by the DFG projects no. HA 1358/1-1 and MU 1479/1-1 and the Bund der Freunde der TUM. REFERENCES [1] H. Pearson, Genetics: what is a gene? Nature, vol. 441, pp. 398 41, 26. [2] International Human Genome Sequencing Consortium, Finishing the euchromatic sequence of the human genome, Nature, vol. 431, pp. 931 945, 24. [3] M. Brent, Genome annotation past, present, and future: how to define an ORF at each locus, Genome Research, vol. 15, pp. 1777 1786, 25. [4] A. Siepel, G. Bejerano, and J. S. Pedersen, Evolutionarily conserved elements in vertebrate, insect, worm, and yeast genomes, Genome Res., vol. 15, no. 8, pp. 134 15, August 25. [5] M. Li, J. Badger, X. Chen, et al., An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, vol. 17, pp. 149 154, 21. [6] Z. Dawy, J. Hagenauer, P. Hanus, et al., Mutual information based distance measures for classification and content recognition with applications to genetics, Proc. of the ICC, 25. [7] X. Chen, M. Li, B. Ma, et al., Dnacompress: fast and effective dna sequence compression, Bioinformatics, vol. 18, no. 12, pp. 1696 1698, 22. [8] E. Margulies, M. Blanchette, D. Haussler, et al., Identification and characterization of multi-species conserved sequences, Genome Research, vol. 13, pp. 257 2518, 23. [9] R. Nielsen, Statistical methods in molecular evolution, Springer Science+Buisness Media, Inc., pp. 325 351, 25. [1] P. Hanus, J. Dingel, J. Hagenauer, et al., An alternative method for detecting conserved regions in multiple species, Proc. of the German Conference on Bioinformatics, p. 64, 25. [11] R. Durbin, S. Eddy, A.Krogh, et al., Biological sequence analysis - probabilistic models of protein and nucleic acids, Cambridge University Press, 1998. [12] Smit, AFA, Hubley, R and P. Green, Repeatmasker open-3.. http://www.repeatmasker.org, 1996-24. [13] E. Birney, D. Andrews, M. Caccamo, et al., Ensembl 26, Nucleic Acids Res., January 26. [14] M. Blanchette, A. R. Bataille, X. Chen, et al., Genome-wide computational prediction of transcriptional regulatory modules reveals new insights into human gene expression, Genome Research, vol. 16, pp. 656 668, 26. [15] G. Battail, Information Theory and Error correcting codes in genetics and biological evolution, Introduction to Biosemiotics, Springer, November 26. [16] S. J. Lolle, J. L. Victor, J. M. Young, et al., Genome-wide nonmendelian inheritance of extra-genomic information in arabidopsis, Nature, vol. 434, pp. 55 59, March 25. [17] B.-J. Yoon and P. Vaidyanathan, Computational identification and analysis of noncoding rnas, IEEE Signal Processing Magazine, vol. 24, no. 1, pp. 64 74, January 27.