Modeling Noise in Genetic Sequences

Similar documents
EVOLUTIONARY DISTANCES

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Phylogenetic Assumptions

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture Notes: Markov chains

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Molecular Evolution and Phylogenetic Tree Reconstruction

Lecture 3: Markov chains.

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

O 3 O 4 O 5. q 3. q 4. Transition

Lecture 4. Models of DNA and protein change. Likelihood methods

The Phylo- HMM approach to problems in comparative genomics, with examples.

the noisy gene Biology of the Universidad Autónoma de Madrid Jan 2008 Juan F. Poyatos Spanish National Biotechnology Centre (CNB)

Phylogenetic Tree Reconstruction

Phylogenetics: Building Phylogenetic Trees

Computational Biology: Basics & Interesting Problems

Computational Cell Biology Lecture 4

Using algebraic geometry for phylogenetic reconstruction

Chapter 2 Class Notes Words and Probability

Reconstruire le passé biologique modèles, méthodes, performances, limites

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

What Is Conservation?

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Stochastic processes and

Estimating the Rate of Evolution of the Rate of Molecular Evolution

Mutation models I: basic nucleotide sequence mutation models

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Evolutionary Analysis of Viral Genomes

Stochastic Processes

Reading for Lecture 13 Release v10

(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.

Dr. Amira A. AL-Hosary

Markov Models & DNA Sequence Evolution

Bioinformatics 2 - Lecture 4

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Inferring Speciation Times under an Episodic Molecular Clock

Evolving Antibiotic Resistance in Bacteria by Controlling Mutation Rate

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Lecture 4. Models of DNA and protein change. Likelihood methods

Stochastic processes and Markov chains (part II)

进化树构建方法的概率方法 第 4 章 : 进化树构建的概率方法 问题介绍. 部分 lid 修改自 i i f l 的 ih l i

7. Tests for selection

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

Hidden Markov models in population genetics and evolutionary biology

Multiple Choice Review- Eukaryotic Gene Expression

Computational statistics

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetic Inference using RevBayes

Phylogenetics. BIOL 7711 Computational Bioscience

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Sequence variability,long-range. dependence and parametric entropy

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

The Phylogenetic Handbook

Evolutionary Models. Evolutionary Models

Stochastic Errors vs. Modeling Errors in Distance Based Phylogenetic Reconstructions

arxiv:q-bio/ v1 [q-bio.pe] 23 Jan 2006

MA6451 PROBABILITY AND RANDOM PROCESSES

Preliminaries. Download PAUP* from: Tuesday, July 19, 16

Lecture Notes: BIOL2007 Molecular Evolution

Applications of hidden Markov models for comparative gene structure prediction

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

2. Mathematical descriptions. (i) the master equation (ii) Langevin theory. 3. Single cell measurements

Can evolution paths be explained by chance alone? arxiv: v1 [math.pr] 12 Oct The model

Evolutionary Computation

BIOINFORMATICS TRIAL EXAMINATION MASTERS KT-OR

RELATING PHYSICOCHEMMICAL PROPERTIES OF AMINO ACIDS TO VARIABLE NUCLEOTIDE SUBSTITUTION PATTERNS AMONG SITES ZIHENG YANG

A DNA Sequence 2017/12/6 1

Bacterial Genetics & Operons

Reinforcement Learning: Part 3 Evolution

Neutral Theory of Molecular Evolution

Inferring Molecular Phylogeny

Sequence analysis and comparison

Basic modeling approaches for biological systems. Mahesh Bule

Hidden Markov Models in Bioinformatics 14.11

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Module 9: Stationary Processes

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

process on the hierarchical group

Genomes and Their Evolution

Processes of Evolution

Taming the Beast Workshop

Understanding relationship between homologous sequences

Maximum Likelihood in Phylogenetics

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Linear Regression (1/1/17)

Introduction to the SNP/ND concept - Phylogeny on WGS data

Markov Chain Monte Carlo

BMI/CS 776 Lecture 4. Colin Dewey

What can sequences tell us?

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Comparative genomics: Overview & Tools + MUMmer algorithm

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Transcription:

Modeling Noise in Genetic Sequences M. Radavičius 1 and T. Rekašius 2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Vilnius Gediminas Technical University, Vilnius, Lithuania 1. Introduction: nucleotide sequences, problems 2. Models: independent and context-dependent evolution 3. Noninformative nucleotide sequences: problem statement, assumptions of a simple evolution model, definition of genetic noise 4. Computer simulations: long-range dependence, comparison with real nucleotide sequences 5. Conclusions 1

1. Introduction 1.1. Nucleotide sequences 4 nucleotides: A,C,G,T Nucleotide sequence...gcaatacgccta... Coding and non-coding regions. Gene is a protein coding nucleotide sequence, and DNA sequences located between genes are non-coding genome sequences. Evolution: mutation, insertion, deletion. Properties of nucleotide sequences: Bacterial genomes: total (0.5 10) 10 6 nucleotides, genes 10 2 10 3. Human genome: total 3.12 10 7 nucleotides, genes 30000, non-coding part 97%. Correlations: long-range dependence in nucleotide sequences indicates a complexity of a system; Li et al. (1992), Peng et al. (1992), Buldyrev et al. (1995), Karlin et al. (1993). 2

1.2. Problems Practical: prediction and identification of biological function(s) of genes or groups of genes, identification of functionally related genes (regions), (re)construction of phylogenetic trees... Decoding: Searching for informative and biologically important regions, extraction of genetic information (mining) Coding: Modeling and generation of nucleotide sequences, simulation of DNA evolution We are interested in a opposite problem: what are noninformative nucleotide sequences? 3

2. Models 2.1. Notation Let X(t), t T, be a (discrete time) finite homogeneous Markov chain, X(t) = {x l (t) A, l = 1,..., n, t T }, Here T = {0, 1,...}, A is a finite set; for DNA sequence A = {A, C, G, T }. Two directions of evolution: Evolution in time X(t) X(t + 1) In the stationary case distribution of X(t) is independent of t. Thus, it defines probability distribution of a random sequence X on the set of sequences A n and we can consider its Evolution in space x l x l+1 4

2.2. Independent evolution The first stochastic models of DNA evolution assumed that the nucleotide along the DNA sequence evolved independently of one another according to the same rule. Jukes and Cantor (1969), Kimura (1980), HKY model (Hasegawa et al. (1985)) Consequently, X is a sequence of i.i.d. variables i.e. independent and identically distributed nucleotides. It is not realistic: results of some statistical analysis are presented below. A natural generalization: models with independent codons (for coding sequences). Muse and Gaut (1994), Goldman and Yang (1994), Pedersen et al. (1998), Schat and Lange (2002). 5

2.3. Context-dependent evolution In context-dependent evolution model it is assumed that mutations in each site (nucleotide or codon) depend on its nearest neighbours. Usually continuous time Markov chains are considered. Time reversibility implies the Markov property. Arndt et al. (2003) The Markov property in space of the evolution is supposed. Hwang and Green (2004) consider a general nonreversible context-dependent nucleotide model, Siepel and Haussler (2004), Christensen at al. (2004) context dependent codon model. Jensen and Petersen (2000, 2001) and Jensen (2005) a discussion of the relation between time reversibility and the Markov property of the stationary measure of contextdependent evolution is given. 6

3. Noninformative nucleotide sequences Problem: What does it mean noninformative nucleotide sequences? How to define genetic noise, i.e. sequence which has no genetically important information? Direct application: Informativeness of a segment in DNA is measured as its distance to noninformative one usually obtained as a random permutation of the initial segment. Assumptions: 1. Non-coding regions of DNA has not direct impact on survival of biological species and thus is not (so) genetically important. 2. Evolution of non-coding regions has simple structure and are controlled by local factors. For instance, here we ignore insertions and deletions and assume that probability of mutation in any site depends exclusively on its nearest neighbours. 3. The stationary distribution of non-coding sequence evolution can be treated as noninformative, i.e. as genetic noise. 7

Definition: Let the evolution X(t), t T, of nucleotide sequences x A n in time be a (discrete time) homogeneous Markov chain with a given transition probabilities Π of a simple structure. If there exists its stationary distribution p on A n, a random sequence X with the distribution p is called noninformative or genetic noise. Assume for simplicity that the site state set A = {0, 1} and consider the Glauber dynamics in time of a random sequence {X(t), t T } (X(t) A n ) of the length n. Suppose that this dynamics is Markov and homogeneous in both time and space but in each site depends on its nearest neighbours. Namely, 1. a nucleotide from sequence is selected with probability 1/n; 2. the selected nucleotide mutates with probability π uzv := P{x l (t + 1) = z x [l 1,l+1] (t) = uzv} (1) l = 2,..., n 1, u, z, v A, t T. Here z = 1 z, x [l 1,l+1] = x l 1 x l x l+1 (we omit the argument t). 8

Let X denote the noise obtained by this evolution, i.e. X is a random sequence of 0 s and 1 s with the stationary (invariant) distribution of {X(t), t T }. It is completely determined by 8 scalar parameters π := {π uzv }. What can we say about the properties of the genetic noise X? Proposition If X is homogeneous Markov chain (in space) of order k < n/4 then: a) k = 1, X is reversible and depends on 2 parameters, θ and γ, b) the probabilities π are satisfy equalities: θ = π 000 π 010, γ = π 010(π 100 +π 001 ) π 000 (π 110 +π 011 ), θγ 2 = π 101 π 111. The proof s are rather straightforward and are based on Hamersley-Clifford theorem for finite random fields. 9

4. Computer simulations Let A = {0, 1}. The sequence X(t) = {x l (t), l = 1,..., n}, t T, evolves under the context-dependent mutation model. Probability of nucleotide mutation in the sequence depends on two neighbouring nucleotides (the same Glauber dynamic model). Several different sets of transition probabilities are considered, for example: symmetric where π uzv π vzu, non-symmetric where π uzv π vzu. Simulation of the sequence evolution starts from a random binary sequence and 10 7 iterations (mutations) are performed. It is assumed that sequence obtained has (approximately) stationary distribution. 10

Autocorrelation function Definition. Let X(t) be a stationary process and there exists a real number H ( 1 2, 1) and a constant c p > 0 such that autocorrelation function ρ(k) c p k 2H 2, k. Then X(t) is called a stationary process with long-range dependence. The exponent H is called Hurst parameter. For H = 1/2 the observations are uncorrelated (c p = 0), and for H (0, 2 1 ) the process has short-range dependence. Long memory is characterized by a slow decay of the correlations proportional to k 2H 2. A plot of the sequence autocorrelation function should therefore exhibit this decay. 11

Autocorrelation function of simulated binary nucleotide sequence of length n = 200, non-symmetric transition probabilities. 12

Estimating of parameter H, R/S analysis Binary nucleotide sequence x l, l = 1,..., n is subdivided into m non-overlapping blocks. We compute the rescaled adjusted range R(t i, d)/s(t i, d) for a number of values d where t i are the starting points of the blocks. R(t i, d) = max{0, W (t i, 1),..., W (t i, d)} where min{0, W (t i, 1),..., W (t i, d)}, W (t i, k) = k j=1 x ti +j 1 k d k = 1,..., d. d x ti +j 1, j=1 S 2 (t i, d) is a sample variance of x ti,..., x ti +d 1. For each values of m and d we obtain a number of R/S samples and plot log(r/s) vs. log d. 13

A least squares line is fitted to the points of the R/S plot. The slope of the regression line for these R/S samples is an estimate for the Hurst parameter H. R/S plot for simulated binary nucleotide sequence of length n = 200, non-symmetric transition probabilities. The Hurst parameter estimate Ĥ = 0.785 14

Real non-coding sequence Below we present autocorrelation function and R/S plots of bacteria Escherichia coli noncoding sequence of length n 200. Nucleotide recoding rule: {C, G} {1}, {A, T } {0}. Relative frequency of nucleotides C+G is 50.79%. Note, that in the generated sequences P(0) = P(1) = 1/2. 15

The Hurst parameter estimate for this sequence Ĥ = 0.782 16

5. General conclusions 1. The probability model of non-informative nucleotide sequence or, in other words, genetic noise (an analogue of the white noise ) is proposed and its properties are studied mainly by computer simulation 2. The long-range dependence in DNA sequences has been extensively studied and is considered as an evidence of their complexity and hierarchical structure. The work reveals that the long-range dependence is intrinsic property of a the genetic noise generated by a simple evolution model. 17