BIOINFORMATICS. Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths

Size: px
Start display at page:

Download "BIOINFORMATICS. Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths"

Transcription

1 BIOINFORMATICS Vol. 6 no. 2 Pages Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths Pierre Baldi, 2, and Pierre-François Baisnée Department of Information and Computer Science and 2 Department of Biological Chemistry, College of Medicine, University of California, Irvine, CA , USA Received on April 24, 2; accepted on May 25, 2 Abstract Motivation: DNA structure plays an important role in a variety of biological processes. Different di- and trinucleotide scales have been proposed to capture various aspects of DNA structure including base stacking energy, propeller twist angle, protein deformability, bendability, and position preference. Yet, a general framework for the computational analysis and prediction of DNA structure is still lacking. Such a framework should in particular address the following issues: () construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and profiles from genomic databases; (4) distribution and asymptotic behavior as the length N of the sequences increases; and (5) complete analysis of correlations between scales. Results: We develop a general framework for sequence analysis based on additive scales, structural or other, that addresses all these issues. We show how to construct extremal sequences and calibrate scores for automatic genomic and database extraction. We show that distributions rapidly converge to normality as N increases. Pairwise correlations between scales depend both on background distribution and sequence length and rapidly converge to an analytically predictable asymptotic value. For di- and tri-nucleotide scales, normal behavior and asymptotic correlation values are attained over a characteristic window length of about 5 bp. With a uniform background distribution, pairwise correlations between empirically-derived scales remain relatively small and roughly constant at all lengths, except for propeller twist and protein deformability which are positively correlated. There is a positive (resp. negative) correlation between dinucleotide base stacking (resp. propeller twist and protein deformability) and AT-content that increases in magnitude with length. The framework is applied to the analysis of various DNA tandem repeats. We derive exact expressions for counting the number of repeat unit classes To whom all correspondence should be addressed. at all lengths. Tandem repeats are likely to result from a variety of different mechanisms, a fraction of which is likely to depend on profiles characterized by extreme structural features. Contact: pfbaldi@ics.uci.edu; baisnee@ics.uci.edu Introduction Evidence is mounting that DNA structural properties beyond the double helical pattern play an important role in a number of fundamental biological processes, both under healthy and pathological conditions. This is not too surprising if one realizes that meters of DNA must be compacted into a nucleus that is only a few microns in diameter while, at the same time, preserving the ability of turning thousands of genes on and off in a precisely orchestrated fashion. The threedimensional structure of DNA, as well as its organization into chromatin fibers, seems to be essential to its functions and has been implicated in diverse phenomena ranging from protein binding sites, to gene regulation, to triplet repeat expansion diseases. The goal of this work is to develop computational methods for the structural analysis of DNA sequences. While DNA structure is our primary motivation and area of application, the framework we develop is completely general and applies to sequences over any alphabet, including codon, RNA, and protein alphabets, whenever local additive scales, as defined below, are available. DNA structure DNA structure has been found to depend on the exact sequence of nucleotides, an effect that seems to be caused largely by interactions between neighboring base pairs (Ornstein et al., 978; Satchwell et al., 986; Breslauer et al., 986; Calladine et al., 988; Goodsell and Dickerson, 994; Sinden, 994; Brukner et al., 995; Hassan and Calladine, 996; Hunter, 996; Ponomarenko et al., 999; Fye and Benham, 999). This means that different sequences can have different intrinsic structures, or different propensities for forming particular structures. c Oxford University Press 2 865

2 P.Baldi and P.-F.Baisnée Periodic repetitions of bent DNA in phase with the helical pitch, for instance, will cause DNA to assume a macroscopically curved structure. Flexible or intrinsically curved DNA is energetically more favorable to wrap around histones than rigid and unbent DNA, and this has been shown to influence nucleosome positioning (Drew and Travers, 985; Satchwell et al., 986; Simpson, 99; Lu et al., 994; Wolffe and Drew, 995; Baldi et al., 996; Zhu and Thiele, 996; Liu and Stein, 997). In addition, the chromatin complex structure of DNA and the positioning of nucleosomes along the genome have been found to play an important (generally inhibitory) role in the regulation of gene transcription (Pazin and Kadonaga, 997; Tsukiyama and Wu, 997; Werner and Burley, 997; Pedersen et al., 998). Sequence-dependent DNA structure is often important for DNA binding proteins, such as TBP (TATA-binding-protein) (Parvin et al., 995; Starr et al., 995; Grove et al., 996) and gene regulation (Sheridan et al., 998). While the number of resolved structures of DNA protein complexes continues to grow in the PDB database, the field of computational DNA structural analysis is clearly far behind its protein cousin and completely lacks any degree of systemicity. Most likely, most DNA structural signals remain to be uncovered. DNA structural scales Based on many different empirical measurements or theoretical approaches, several models have been constructed that relate the nucleotide sequence to DNA flexibility and curvature (Ornstein et al., 978; Satchwell et al., 986; Goodsell and Dickerson, 994; Sinden, 994; Brukner et al., 995; Hassan and Calladine, 996; Hunter, 996; Baldi et al., 998; Ponomarenko et al., 999). These models are typically in the form of dinucleotide or trinucleotide scales that assign a particular value to each di- or tri-nucleotide and its reverse complement. A non-exhaustive list of such scales includes: () The dinucleotide base stacking energy (BS) scale (Ornstein et al., 978) expressed in kilocalories per mole. The scale is derived from approximate quantum mechanical calculations on crystal structures. (2) The dinucleotide propeller twist angle (PT) scale (Hassan and Calladine, 996) measured in degrees. This scale is based on X-ray crystallography of DNA oligomers. Dinucleotides with a large negative propeller-twist angle tend to be more rigid than dinucleotides with low negative propeller-twist angle. (3) The dinucleotide protein deformability (PD) scale (Olson et al., 998) derived from empirical energy functions extracted from the fluctuations and correlations of structural parameters in DNA protein crystal complexes. Dinucleotides with large PD values tend to be more flexible. (4) The trinucleotide bendability (B) model (Brukner et al., 995) based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds (to the minor groove) and cuts DNA that is bent, or bendable, towards the major groove (Lahm and Suck, 99; Suck, 994). Thus Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or anisotropic bendability. These frequencies allow for the derivation of bendability parameters for the 32 complementary trinucleotide pairs. Large B values correspond to flexibility. (5) The trinucleotide position preference (PP) scale derived from experimental investigations of the positioning of DNA in nucleosomes. It has been found that certain trinucleotides have a strong preference for being positioned in phase with the helical repeat. Depending on the exact rotational position, such triplets will have minor grooves facing either towards or away from the nucleosome core (Satchwell et al., 986). Based on the premise that flexible sequences can occupy any rotational position on nucleosomal DNA, these preference values can be used as a triplet scale that measures DNA flexibility. Hence, in this model, all triplets with close to zero preference are assumed to be flexible, while triplets with preference for facing either in or out are taken to be more rigid and have larger PP values. Note that we do not use this scale as a measure of how well different triplets form nucleosomal DNA. Instead, the absolute value, or unsigned nucleosome positioning preference, is used here, as in Pedersen et al. (998), as a measure of DNA flexibility. For completeness, all these scales are displayed in the appendix. In previous studies, we found these models useful (Baldi et al., 996; Pedersen et al., 2; Baldi et al., 999), in particular for the detection of putative new structural signatures associated with an increase of bendability in downsteam regions of RNA polymerase II promoters. A similar approach (Liao et al., 2) was used to analyze the structure of insertion sites for P transposable elements in Drosophila melanogaster and suggest that the corresponding transposition mechanism recognizes a structural signature rather than a specific sequence motif. With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA properties related directly or indirectly to structure, such as enthalpy, or melting temperature, have also been 866

3 Sequence analysis by additive scales proposed (Breslauer et al., 986; Ponomarenko et al., 999). The primary focus of this work is not on assessing the merits and pitfalls of each model, but rather on the development of general methods for the systematic application of any scale to any sequence of any length, up to entire genomes, under the assumption that the scale can be used additively within a sliding window. In general, this assumption will provide a reasonable approximation, at least up to a certain length to be determined experimentally. In particular, we are interested in the development of methods for the automatic recognition of structural motifs associated with extremal features, such as extreme stiffness or bendability. The calibration of corresponding thresholds is expected to be useful in database searches and is conceptually similar, for instance, to the calibration of thresholds for detecting sequence homology. More generally, however, database searches may also be conducted on the basis of structural signatures or profiles that need not be extremal and could be obtained from reasonable training sets. Certain protein binding sites, for instance, are highly degenerate at the DNA sequence level, with low sequence homology, while exhibiting at the same time a high degree of DNA structural similarity. Similarly, periodic flexible triplets in phase with the double helical pitch are necessary to ensure long range curvature, for instance in nucleosome regions. Although several scales may agree on some structural features, the fact remains that they may also display divergent interpretations of some sequence elements. While no final consensus regarding these models exists, it is likely that each one provides a slightly different and partially complementary view of DNA structure. Thus a second goal of this work is the comparison of the models in the limited sense of estimating the statistical correlation between different scales. In Baldi et al. (998) it was shown that by and large many of the commonly used scales exhibit low correlations measured at the level of single dior tri-nucleotides. Empirical measurement of correlations between the scales over longer lengths in Escherichia coli have recently revealed different unexplained patterns (Pedersen et al., 2). Here we provide a complete explanation of this phenomena and show how correlations vary with background distribution and with window length. Finally, while the methods introduced can be applied to any DNA sequence, we focus here on a particularly important class of DNA sequences, namely DNA tandem repeats, where the general framework is further specialized. DNA repeats Genomes, especially eukaryotic genomes, are replete with DNA repetitive regions (Jurka et al., 992; Jeffreys, 997; Jurka, 998). Well over 3% of the human genome has been estimated to comprise repetitive DNA of some sort (Benson and Waterman, 994) the exact function of which is often unknown. Such DNA arises through many different evolutionary and genetic mechanisms. Over 95 different classes (Jurka, 998) of repeats have been censed. Two major groups of repeats exist: interspersed repeats, and tandem repeats. While the methods to be developed can be applied to both groups, our analysis will focus on tandem repeats, consisting of two or more contiguous copies of a particular pattern of nucleotides. Tandem repeats may cover up to % of the human genome. Tandem repeats vary widely, over several orders of magnitude, both in terms of the length of the repeating pattern and the number of more or less exact contiguous copies. Repeats are often polymorphic and therefore play a major role in linkage studies and DNA fingerprinting. In many cases, the genetic origin, the structure, and the function of these repetitive regions is poorly understood. There exist a few examples, however, where the repeats are known to play a biological role in both healthy and pathological conditions. Certain tandem repeats, for instance, have been associated with protein binding sites or interactions with transcription factors. An important advance in epigenetics research has been the realization that interactions between repeated DNA sequences can trigger the formation and the transmission of inactive genetic states and DNA modifications (Wolffe and Matzke, 999). In several of these cases, the particular DNA-helical structural features of the repeat sequences seem to play an essential role. Interest in tandem repeats has been heightened over the last few years by the discovery that several important degenerative disorders including Huntington s disease, myotonic dystrophy, fragile X syndrome, and several forms of ataxia, result from the abnormal expansion of particular DNA triplets (The Huntington s Disease Collaborative Research Group, 993; Ashley and Warren, 995; Ross, 995; Gusella and MacDonald, 996; Hardy and Gwinn-Hardy, 998; Rubinsztein and Hayden, 998; Baldi et al., 999). The exact mechanism by which a triplet repeat mutation causes disease varies as indicated by the fact that currently known repeat expansions are found both in 5 UTRs, in 3 UTRs, in introns, and within coding sequences of various affected genes (Ashley and Warren, 995; Gusella and MacDonald, 996; Rubinsztein and Amos, 998; Rubinsztein and Hayden, 998). For instance, fragile X mental retardation is associated with an expanded CGG repeat in the 5 UTR of the FMR gene (Nelson, 995; Eichler and Nelson, 998). The 64 possible triplets can be clustered into 2 equivalence classes when shift and reverse complement operations are considered (see below). Currently only three repeat classes CAG, CGG, and GAA, out of the possible twelve, are associated with triplet repeat disorders. There is evidence that unusual structural features of the repeats play a role in their expansion (Wells, 996; 867

4 P.Baldi and P.-F.Baisnée Pearson and Sinden, 998a,b; Moore et al., 999). In Baldi et al. (999), the structural scales above were used to show that the triplet classes involved in the diseases have extreme structural characteristics of very high or very low flexibility. Methods to quantify the degree of extremality relative to other sequences, however, were not developed. Furthermore, other triplet or non-triplet repeats may play a role in diseases as well as other biological processes. Therefore the techniques need to be improved and extended to all classes of repeats. Hence, given the importance of repeating patterns and the exponential growth of sequence databases, our goal is also to develop new tools for the computational analysis of the structural properties of arbitrary repeats and begin to apply such techniques in a systematic and quantifiable way. Various algorithms for searching tandem repeats have been developed (Milosavljevic and Jurka, 993; Benson and Waterman, 994; Benson, 999; Blanchard et al., 2). The techniques presented here can also be viewed as complementing such algorithms by introducing a structural perspective. Organization The remainder of the paper is organized as follows. In the next section we develop a general framework for the analysis of the score of a sequence (repetitive or not) under any additive scale. We determine the number of different sequence equivalence classes under circular permutation and reverse complement operations. We show how to determine and visualize maximal and minimal patterns and study the statistical properties of the scales, including intra scale (mean and variance) and interscale (correlations) statistics for sequences of various lengths, as well as asymptotic normality. This framework is essential in order to compare the behavior of various scales, to locate a given sequence with respect to a comparable population, and to automatically set thresholds in database searches. We then apply the general framework to the five structural models described above and various tandem repeats. Methods and theory General framework The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S. The scale is a function that assigns a value to any S-tuple of the alphabet, for instance in the form of a table with A S entries. In the result section, we deal exclusively with the nucleotide DNA alphabet (A = 4) and with DNA scales, such as dinucleotide with S = 2 (e.g. propeller twist) or trinucleotide with S = 3 (e.g. bendability) structural scales. The same framework, however, can readily be applied to other situations (e.g. amino acid alphabet with hydrophobicity scales). Given a primary sequence s = X X 2...X N of length N S over A, we assume that the scale S is approximately additive in the sense that the corresponding global property of the sequence s can be estimated by sliding the scale along the sequence in the form S(s) = S(X...X S ) + S(X 2...X S+ ) + = N S+ i= S(X i...x i+s ). () In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a more homogeneous per base-pair value (W N). This averaging process does not concern us at this stage since it merely amounts to using a different scale, with a larger size. The form given in equation () corresponds to a free boundary condition. The ideas to be developed can be applied to other boundary conditions, including periodic boundary conditions, where the sequence is wrapped around, as described below. With the proper modifications, the theory applies immediately to the case where the scales are shifted by more than one position at each step. Consider now a repeat sequence r consisting of a unit pattern or period p = (X...X P ) of length P, and repetition number R >, so that r = (X...X P ) R with N = PR S. Notice that the period is not uniquely determined since, for instance, XXXX can be viewed as (X) 4,oras(XX) 2. In addition, we will assume that P + S N, or equivalently that S (R )P + so that the scale S is applied starting at least once from each letter in the repetitive unit, without exceeding the repetitive sequence boundary. In this case, S(r) has the form: S(r) = ls(p) + ɛ (2) where S(p) is the contribution of the periodic unit S(p) = S(X...X S ) + S(X 2...X S+ ) + +S(X P X...X S ) (3) P = S(X i...x i+s ) [modp]. i= The number l of times the periodic unit is covered by S and its shifted version is given by: PR S + S l = = R. (4) P P Finally, if lp + S = RP then the boundary tail ɛ is equal to. Otherwise ɛ = S(X lp+...x lp+s ) + +S(X RP S+...X RP ) (5) 868

5 Sequence analysis by additive scales where indices can be taken modulo P, i.e. X lp+ = X and so forth. The sum in equation (5) has at most P terms. In practice, at least in the case of DNA, only short scales are currently available and therefore in most cases, S P +. In this case, equation (2) simplifies to: S(r) = (R )S(p) + ɛ. (6) Equivalence classes In the special case of repetitive sequences, we also need to be able to count the number of different repeats with respect to a given scale. It is often the case that the scale S is characterized by some kind of invariance with respect to the sequences of length S of A. In the case of DNA, the structural scales we have are invariant with respect to the reverse complement. When looking at repeat sequences, this determines how many different repeat patterns of length P need to be considered. A triplet repeat, for instance, can be described in terms of different unit trinucleotides depending on what strand and triplet frame is chosen. Thus, the repeat CAGCAGCAG... can be said to be a repeat of the triplet CAG, and also of its reverse complement CTG. Ignoring repeat boundaries, however, the sequence can also be described as a repeat of the shifted triplet pairs AGC/GCT and GCA/TGC. In this way, the 64 different trinucleotides can be divided into 2 possible repeat classes. Of these 2 classes, only are proper triplet repeat classes in the sense that they do not result from a repeat pattern of shorter length. The two classes associated with shorter patterns are obviously the triplet pairs AAA/TTT and CCC/GGG which are more precisely described as mononucleotide repeats. [For a generic alphabet A, a reverse complement operation can be defined by introducing a one to one function X X from the alphabet to itself, satisfying X = X so that the reverse complement of X...X N is defined to be X N,..., X.] In the case of a DNA repeat with unit repeat length P, the number of classes and the number of elements in each equivalence class is dictated by the action of the group of transformations associated with the circular permutations and the reverse complement operations on the set of all possible strings of length P. AAA.../TTT... and CCC.../GGG... always give rise to two separate classes with two elements each. In general, a typical class will contain 2P elements associated with the P permutations and the P reverse complements. Classes containing less elements, however, can arise for instance as a result of sub-periodicity effects when P is not prime, and of identical reverse complement effects. For instance, when P = 4, the class of ATAT contains only two elements since it is identical to its reverse complement and can be shifted circularly only once before returning to the original pattern. The number of classes can be counted using standard group theory arguments detailed in the appendix. These arguments are not restricted to circular permutation and reverse complement operations, but apply to any group of transformations over any sequences. The number of classes, when only circular permutations without reverse complement are taken into account, is given by P φ d P ( ) P A d = d P k P A (P,k) (7) where (P, k) is the greatest common divider (gcd) of P and k. φ(n) is the Euler function counting the number of integers less than n which are prime to n, i.e. without common dividers with n. If p,...,p k is the list of distinct prime factors of n, then the Euler function can be expressed as: φ(n) = n k ) ( pi. (8) i= When both circular permutations and reverse complement are taken into account, the number of classes for odd P is given by 2P φ d P ( ) P A d = d 2P k P A (P,d). (9) When P is even, the corresponding number of classes is [ ( ] P φ )A d + P2 2P d AP/2 () d P or, equivalently, [ ] A (P,d) + P 2P 2 AP/2. () k P In particular, when P is prime, the number of different classes under periodic and reverse complement equivalence is 2P [(P )A + AP ]. (2) The number of classes which are new at a given length P, i.e. that do not result from the repetition of a shorter pattern of length dividing P, can easily be obtained by subtracting the corresponding counts for each divisor of P. When P is prime, all classes are new except for the classes resulting from mono-letter repeats. Table in the Results section exemplifies the application of equations (9) (2). 869

6 P.Baldi and P.-F.Baisnée Extremal sequences and automata We are interested in the construction and recognition of sequences s that are extremal for S, i.e. such that S(s) is very large or very small relative to the other sequences of length N. For this, we attach to each scale a prefix automata, or prefix graph. The prefix automata can be described by a directed graph containing A S nodes, each labeled by a string of length S overa of the form X...X S (see Figure for an example). Each node has A directed outgoing connections. X...X S is connected to X 2...X S Y, for each letter Y in A, hence the notion of prefix. The weight (or length) of the corresponding transition is provided by the entry associated with X X 2...X S Y in the structural table. The A nodes labeled (X) S = XXX...X (monorepeats) are the only ones to have a self-connection. Any sequence s of length N, is trivially associated with the path: X...X S X 2...X S X N S+2...X N. The value of S(s) is found just by adding the weights of the corresponding connections. As a result, sequences associated with maximal or minimal values of S(s) correspond to paths in the prefix graph, with maximal or minimal total weight or length. These can easily be found by standard dynamic programming techniques which can also be extended to finding, for instance, the k longest or shortest paths. A repeat pattern of length P is a directed cycle in the prefix automata graph. Notice that any path of length greater than A S must intersect itself at least once. Thus any cycle of length strictly greater than A S must be composed of non-intersecting cycles of length at most A S.For instance, with a dinucleotide scale, any repeat unit of length greater than four must contain at least two cycles of length at most four. Therefore in the study of repeats, we need only to study the properties of all non-intersecting directed cycles of length up to A S together with all possible ways of joining them. In addition to dynamic programming techniques, it is also useful to tabulate the weights of all possible short cycles for at least two reasons. First, because longer patterns are built from shorter cycles. Second, at least in the case of DNA, many important existing repeats, such as triplet repeats, are based on a short repeating pattern. While the prefix graph is useful for constructing extremal sequences and recognizing them as long as A, S and N are small, it is also necessary to develop more general techniques by which we can rapidly assess, for any sequence s, the magnitude of S(s) with respect to all the other comparable sequences. This is best achieved by viewing the sequences in a probabilistic context. Probabilistic modeling Consider now that sequences are being generated by a random process. In order to fix the ideas, we take for simplicity a Markov model of order, i.e. we assume that sequences are generated by N tosses of the same die with distribution D = (p X ) over the alphabet A. The same analysis, however, can easily be extended to other probabilistic models such as higher-order Markov models where distributions are defined, for instance, on pairs or triplets of letters. From equation (), S(s) is now a random variable which is the sum of N S + random variables: S(s) = Y + + Y N S+. By construction, all the variables Y i = S(X i...x i+s ) have the same distribution, but they are not independent. Rather they satisfy a form of local dependence, called m-dependence in statistics. More precisely, for i < j, Y i and Y j are independent if and only if j i S. Using the linearity of the expectation, we have: E(S(s)) = (N S + )E(Y i ) Nα S (3) with E(Y i ) = S(X...X S )p(x )...p(x S ) = α S X...X S (4) the sum being over all A S S-tuples of the alphabet. To situate an individual sequence with respect to the entire population, we need to calculate the variance. The variance also can be calculated explicitly by taking advantage of the local dependence of the variables Y i.we have Var(S(s)) = (N S + )Var(Y i ) + 2 Cov(Y i, Y j ) < j i<s (5) with the covariances Cov(Y i, Y j ) = E[(Y i E(Y i ))(Y j E(Y j ))]. As soon as j i S, Y i and Y j are independent and the corresponding covariance is. Thus, for any given scale S, one needs only to tabulate the expectation E(Y i ) and the S relevant short-range covariances C k = C k (S) = Cov(Y i, Y i+k ) (6) for k < S (C = Var(Y i )). Alternatively, by factoring out the variance of Y i, equation (5) can also be expressed in terms of the correlations [ ] Var(S(s)) =Var(Y i ) N S++2 Cor(Y i, Y j ). < j i<s (7) To obtain the exact variance at each length N, it is then only a matter of counting how many times each type of covariance is present in the sequences and adjust for any boundary effects as needed. 87

7 Sequence analysis by additive scales If N 2S, then Var(S(s)) = (N S + )C S +2 (N S k + )C k. (8) If S N < 2S, then k= Var(S(s)) = (N S + )C N S +2 (N S k + )C k. (9) k= It is worth noticing that, for fixed S, both the expectation and the variance are linear in N. In particular, for large N [ ] [ ] S S Var(S(s)) N C + 2 C k = N C k = Nβ S. k= S+ (2) In the last equality, for obvious symmetry reasons, we let C k = C k. This notation will prove to be useful below. In the case of repetitive sequences, it is also useful to calculate the expectation of S(p) = Y + Y P, and its variance with periodic boundary conditions modulo P, i.e. assuming the variables Y...Y P and the corresponding letters are arranged along a circle. Here both the expectation and the variance are directly proportional to P and satisfy E(S(p)) = α S P and Var(S(p)) = β S P. Clearly, for any P, E(S(p)) = PE(Y i ) so If P 2S, S β S = C + 2 C k = α S = E(Y i ). (2) k= S S+ C k. (22) When S P < 2S, all variables along the circle are dependent and therefore β S = β S (P) is given by β S (P) = C + 2 n C k = k= n C k (23) n when P = 2n +, and n β S (P) = C + C n + 2 C k = C n + k= n n+ C k (24) when P = 2n. Periodic boundary conditions must be used in the computation of the covariances C k whenever necessary ( k > P S). For a periodic sequence r, where the period P as well as S are small relative to the length N = RP we can use: E(S(r)) RE(S(p)) (25) and the approximation Var(S(r)) R Var(S(p)). (26) For long repetitive sequences with period P < S, we can use the same approach with a larger period P, multiple of P, so that S P. Central limit theorem. S(s) consists of a sum of identical but non-independent random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local, a sum Z = Y + + Y K of K m- dependent random variables Y i still approaches a normal distribution. This can be shown using the theorem in Baldi and Rinott (989) which provides also a bound on the rate of convergence. Here we use the improved bound found in Rinott and Dembo (996). We let max i Y i E(Y i ) =B, and E ( K i= Y i E(Y i ) ) /K = µ. For all the scales to be considered, these constants are well defined and easy to compute. Under these assumptions, ( ) Z EZ P u (u) Var(Z) 7K µ [Var(Z)] 3/2 (2S )2 B 2 (27) where (u) is the normalized Gaussian distribution. The factor (2S ) represents the size of the clusters associated with m-dependence. For a fixed scale, such size is constant but the theorem remains true if S grows slowly with N. Thus equation (27) can readily be applied to S(s) or S(r) with K = N S + ork = RP. From equation (2), the variance of the sequences being considered is linear in their length: Var(S(s)) β N, where β depends only on the scale S. Thus we obtain a convergence rate that scales at most like / N ( ) Z EZ P u (u) Var(Z) C N (28) with C 7µ(2S ) 2 B 2 β 3/2. The rate of this bound is known to be essentially optimal (similar to the Berry Esseen theorems, Feller, 97). Normalized distances and extremal sequences. The value of S(s) or S(r) of any sequence or repeat of length N can be compared to the average value of a background population by computing a normalized Z-score of the form: Z(s) = S(s) α S N. (29) βs N A repeat r with period unit length P and repetition R (N = RP) can be compared to a background population of repeats, or a background population of 87

8 P.Baldi and P.-F.Baisnée generic sequences. In the latter case, we have S(r) S(p)R or S(r) α N. Therefore the Z-score Z(r) = (α α) N β (3) grows with N and is larger than the Z-score Z(p) computed on the repeat unit. In other words, if a repeat unit displays extremal features when compared to other repeat units of the same length, its expansion will appear even more extreme compared to the background of all sequences of similar length. The Z-scores can be used to assess how extreme a sequence is and to search databases for subsequences with extremal features. As in the case of alignments, this can also be done using extreme value distributions (Durbin et al., 998). Note also that one can search a database using a structural profile rather than extreme values. The degree of similarity between two profiles can be measured, for instance, using the standard mean square error. Correlations between scales. It is useful to have some information regarding the degree of correlation between two scales and how such correlation behaves at all sequence lengths. Consider then two scales S and S 2 of length S and S 2. Without any loss of generality assume that S S 2. For sequences s of length N, we are interested in measuring the correlation between the random variables S (s) = Y + +Y N S +,withy i = S (X i...x i+s ), and S 2 (s) = Z + + Z N S2 +, with Z i = S 2 (X i...x i+s2 ).Wehave: Cor (S (s), S 2 (s)) = Cov(S (s), S 2 (s)) Var(S (s)) Var(S 2 (s)). (3) Again only terms of the form Cov(Y i, Z j ), where the distance between i and j is small, are non-zero. More precisely, non-zero terms can arise only if j i S or i j S 2. It is sufficient to tabulate the finite set of S + S 2 covariances Cov(Y i, Z i+k ) C k = C k (S, S 2 ) = E[(Y i E(Y i ))(Z i+k E(Z i+k ))] (32) with S S 2 and S 2 + k S. These covariances can be used to compute correlations at all lengths by writing Cov(S (s), S 2 (s)) = (N S 2 + )Cov(Y i, Z i ) +2 i = j Cov(Y i, Z j ). (33) For large N it is clear that, except for small boundary effects, each type of covariance occurs approximately N times in the formula above. Therefore for large N, Cov(S (s), S 2 (s)) behaves approximately as [ S ] [ S N C + C k = N C k ]. (34) k= S 2 +,k = k= S 2 + We have seen in equations (2) that the variance of each scale is also asymptotically linear in the length N. Thus, as N increases, the correlation Cor(S (s), S 2 (s)) rapidly converges to a constant given by: S k= S 2 + C k(s S 2 ) [( S )( k= S + C S2 )] /2. (35) k(s ) k= S 2 + C k(s 2 ) In checking calculations on DNA scales (or other alphabets) that are invariant under the reverse complement operation, it is worth noticing that with a uniform distribution on the alphabet (p A = p C = p G = p T =.25), the correlations are symmetric. That is, for any < k < S we have C k (S, S 2 ) = C k (S, S 2 ). This results immediately from the fact that the sum of the terms S 2 (X...X S2 ) S (X...X S ) and S 2 ( X S2... X ) S ( X S2... X S2 S +) is equal to the sum of the terms S 2 (X...X S2 ) S (X S2 S +...X S2 ) and S 2 ( X S2... X ) S ( X S... X ), and similarly for other degrees of overlaps. The terms in the sums can be identically paired using the fact that S and S 2 are assumed to be reverse-complement invariant. The result is not true if the scales, or the distribution, are not reverse-complement invariant. Results DNA repeat equivalence classes We wrote a program that cycles through all possible DNA sequences of length P counting and listing all the classes that are equivalent under circular permutation and reverse complement operations. Because of this equivalence, in the case of scales that are reverse-complement invariant, it is sufficient to study the repeats of one representative member of each class. We ran the program up to length P = 2. The results, shown in Table, are in complete agreement with equations (9) (2). In Tables 2, 3 and 4 we list alphabetically all the members of each equivalence class for sequences of length 2 4. When P = 4, for instance, one finds 39 classes: 26 classes with 8 elements, 8 classes with 4 elements, and 4 classes with 2 elements. Only 33 classes are new, in the sense that 6 classes are derived from patterns already encountered at P = and 2. Likewise, when P > 2 is a prime number, the total number of classes is given by: 4 P 4 2P + 2 (36) 872

9 Sequence analysis by additive scales Table. Number of repeat unit equivalence classes. New or proper classes are classes that do not contain a shorter periodic pattern Sequence length Classes (total) Classes (new) Table 2. Dinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Classes and 5 are not proper dinucleotide classes Class number List of members (alphabetical order) AA TT 2 AC CA GT TG 3 AG CT GA TC 4 AT TA 5 CC GG 6 CG GC Table 3. Trinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally A C Class number List of members (alphabetical order) AAA TTT 2 AAC ACA CAA GTT TGT TTG 3 AAG AGA CTT GAA TCT TTC 4 AAT ATA ATT TAA TAT TTA 5 ACC CAC CCA GGT GTG TGG 6 ACG CGA CGT GAC GTC TCG 7 ACT AGT CTA GTA TAC TAG 8 AGC CAG CTG GCA GCT TGC 9 AGG CCT CTC GAG GGA TCC ATC ATG CAT GAT TCA TGA CCC GGG 2 CCG CGC CGG GCC GCG GGC -8. G T Fig.. Dinucleotide prefix automata for the propeller twist angle scale. The CAG repeat, for instance, is associated with the cycle C A G C in the graph and has a total propeller twist value of = The corresponding reverse complement cycle is given by C T G C. The triplet repeat class with the largest propeller twist value is CCC followed by CCG. with two classes of size 2 associated with poly-a and poly- C, while all the remaining classes are new and contain 2P members. In the appendix, we provide tables in alphabetical order that allow to invert Tables 3 and 4, i.e. to find the class associated with any given P-tuple ( P = 3, 4). Analysis of DNA repeats by dinucleotide scales In the case of dinucleotide scales, the prefix automata contains four nodes (Figure ). Each DNA sequence is associated with a path through the corresponding graph, and exact repeats are associated with cycles. All paths, including cycles, of length greater than four are composite in the sense that they contain a cycle of length 4 or less. In Table 5, we list the dinucleotide scale values S(X X 2 ) + S(X 2 X ) for the six equivalence classes associated with all 6 possible dinucleotide repeats of the form (X X 2 ) R. For each scale, we list classes (represented by their first alphabetical member) and the corresponding scale value, in decreasing value order. The highest level of base stacking energy is achieved by the AT repeat class (.39) and the lowest by the CG repeat class ( 24.28). The ranking of all possible dinucleotide repeats induced by the propeller twist and the protein deformability scales are identical with the exception of an inversion between the CC ( 6.22 and 2.2) and CG ( 2. and 6.) classes at the high (flexible) end of the spectrum. At the 873

10 P.Baldi and P.-F.Baisnée Table 4. Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally Class number List of members (alphabetical order) AAAA TTTT 2 AAAC AACA ACAA CAAA GTTT TGTT TTGT TTTG 3 AAAG AAGA AGAA CTTT GAAA TCTT TTCT TTTC 4 AAAT AATA ATAA ATTT TAAA TATT TTAT TTTA 5 AACC ACCA CAAC CCAA GGTT GTTG TGGT TTGG 6 AACG ACGA CGAA CGTT GAAC GTTC TCGT TTCG 7 AACT ACTA AGTT CTAA GTTA TAAC TAGT TTAG 8 AAGC AGCA CAAG CTTG GCAA GCTT TGCT TTGC 9 AAGG AGGA CCTT CTTC GAAG GGAA TCCT TTCC AAGT ACTT AGTA CTTA GTAA TAAG TACT TTAC AATC ATCA ATTG CAAT GATT TCAA TGAT TTGA 2 AATG ATGA ATTC CATT GAAT TCAT TGAA TTCA 3 AATT ATTA TAAT TTAA 4 ACAC CACA GTGT TGTG 5 ACAG AGAC CAGA CTGT GACA GTCT TCTG TGTC 6 ACAT ATAC ATGT CATA GTAT TACA TATG TGTA 7 ACCC CACC CCAC CCCA GGGT GGTG GTGG TGGG 8 ACCG CCGA CGAC CGGT GACC GGTC GTCG TCGG 9 ACCT AGGT CCTA CTAC GGTA GTAG TACC TAGG 2 ACGC CACG CGCA CGTG GCAC GCGT GTGC TGCG 2 ACGG CCGT CGGA CGTC GACG GGAC GTCC TCCG 22 ACGT CGTA GTAC TACG 23 ACTC AGTG CACT CTCA GAGT GTGA TCAC TGAG 24 ACTG AGTC CAGT CTGA GACT GTCA TCAG TGAC 25 AGAG CTCT GAGA TCTC 26 AGAT ATAG ATCT CTAT GATA TAGA TATC TCTA 27 AGCC CAGC CCAG CTGG GCCA GCTG GGCT TGGC 28 AGCG CGAG CGCT CTCG GAGC GCGA GCTC TCGC 29 AGCT CTAG GCTA TAGC 3 AGGC CAGG CCTG CTGC GCAG GCCT GGCA TGCC 3 AGGG CCCT CCTC CTCC GAGG GGAG GGGA TCCC 32 ATAT TATA 33 ATCC ATGG CATC CCAT GATG GGAT TCCA TGGA 34 ATCG CGAT GATC TCGA 35 ATGC CATG GCAT TGCA 36 CCCC GGGG 37 CCCG CCGC CGCC CGGG GCCC GCGG GGCG GGGC 38 CCGG CGGC GCCG GGCC 39 CGCG GCGC opposite (stiff) end, we find the single letter repeat class AA ( 37.2 and 5.8) followed by the proper dinucleotide repeat class AG ( and 6.6). In Table 6, we list the dinucleotide scale values for the 2 equivalence classes associated with all possible triplet repeats of the form (XYZ) R. In this special case, we find the results of Baldi et al. (999). The high and low ends of the base stacking energy scale are occupied by the triplet classes AAT ( 5.76) and CCG ( 32.54) respectively. We find again a high degree of correlation between the propeller twist and protein deformability scales. If we exclude the classes AAA/TTT ( 55.98) and CCC/GGG ( 24.33), which are not proper triplet repeat classes, then the maximum and the minimum of the propeller twist spectrum are respectively occupied by the classes CCG ( 29.22) and AAG ( 46.4). A similar ranking with the same extremal triplets is observed with the protein deformability scale: CCG (22.2) occupies the high end, whereas AAA (8.7) and AAG (9.5) occupy the low end of the spectrum. When considering all three dinucleotide scales, three minima and two maxima are occupied by two of the three repeat classes known to be involved in triplet repeat expansion diseases, namely AAG and CCG. GAA triplet (in the AAG class) expansion is associated with Friedreich s ataxia (Orr et al., 993; Campuzano et al., 874

11 Sequence analysis by additive scales Table 5. Dinucleotide structural scale values for repeat unit p = X X 2 with P = 2. S(p) = S(X X 2 ) + S(X 2 X ) Base Propeller Protein stacking twist deformability AT (.39) CC ( 6.22) CG (6.) AA (.74) CG ( 2.) CC (2.2) CC ( 6.52) AC ( 22.55) AC (2.) AG ( 6.59) AT ( 26.86) AT (7.9) AC ( 7.8) AG ( 27.48) AG (6.6) CG ( 24.28) AA ( 37.32) AA (5.8) Table 6. Dinucleotide structural scale values for repeat unit p = X X 2 X 3 with P = 3. S(p) = S(X X 2 ) + S(X 2 X 3 ) + S(X 3 X ). Repeat classes associated with triplet repeat expansion diseases are in bold Base Propeller Protein stacking twist deformability AAT ( 5.76) CCC ( 24.33) CCG (22.2) AAA ( 6.) CCG ( 29.22) ACG (8.9) ACT ( 2.) ACC ( 3.66) CCC (8.3) AAG ( 2.96) AGC ( 34.53) ACC (8.2) AAC ( 22.45) AGG ( 35.59) AGC (5.9) ATC ( 22.95) ACG ( 36.6) ATC (5.9) CCC ( 24.78) ATC ( 37.94) AAC (5.) AGG ( 24.85) ACT ( 38.95) AGG (2.7) ACC ( 25.34) AAC ( 4.2) AAT (.8) AGC ( 27.94) AAT ( 45.52) ACT (.7) ACG ( 3.) AAG ( 46.4) AAG (9.5) CCG ( 32.54) AAA ( 55.98) AAA (8.7) 996; Junck and Fink, 996; Paulson et al., 997; Koenig, 998; Lee, 998; Orr and Zoghbi, 998; Paulson, 998; Pulst, 998; Stevanin et al., 998). Abnormal GCC triplet (in the CCG class) expansion is associated with FRAXE mental retardation and abnormal expansion of the CGG triplet with fragile X syndrome (FRAXA) (Nelson, 995; Gusella and MacDonald, 996; Eichler and Nelson, 998; Skinner et al., 998; Gecz and Mulley, 999). The third triplet expansion disease related class, AGC, has average rank in all dinucleotide scales. In Table 7, we list the scale values for the 39 equivalence classes associated with all possible tetranucleotide repeats of the form (X X 2 X 3 X 4 ) R. The maximum of the base stacking scale is occupied by the dinucleotide repeat ATAT ( 2.78) and the proper tetranucleotide repeat AAAT ( 2.3). The minimum corresponds to CGCG ( 48.56) followed by ACGC ( 4.36). We again observe a substantial positive correlation between the values produced by the propeller twist and protein deformability scales together with a weaker negative correlation with respect to the base stacking energy scale. The high end of the propeller twist scale is occupied by CCCC Table 7. Dinucleotide structural scale values for repeat unit p = X X 2 X 3 X 4 with P = 4. S(p) = S(X X 2 ) + S(X 2 X 3 ) + S(X 3 X 4 ) + S(X 4 X ) Base Propeller Protein stacking twist deformability ATAT ( 2.78) CCCC ( 32.44) CGCG (32.2) AAAT ( 2.3) CCCG ( 37.33) CCCG (28.3) AATT ( 2.3) CCGG ( 37.33) CCGG (28.3) AAAA ( 2.48) ACCC ( 38.77) ACGC (28.2) AACT ( 26.48) CGCG ( 42.22) ATGC (25.2) AAGT ( 26.48) AGCC ( 42.64) ACCG (25.) AGAT ( 26.98) AGGC ( 42.64) ACGG (25.) AAAG ( 27.33) ACGC ( 43.66) CCCC (24.4) ACAT ( 27.47) AGGG ( 43.7) ACCC (24.3) AAAC ( 27.82) ACCG ( 44.72) ACAC (24.2) AATC ( 28.32) ACGG ( 44.72) ACGT (23.) AATG ( 28.32) ATGC ( 44.99) AGCG (22.7) ACCT ( 29.37) ACAC ( 45.) ATCG (22.7) AAGG ( 3.22) ATCC ( 46.5) AGCC (22.) AACC ( 3.7) ACCT ( 47.6) AGGC (22.) ATCC ( 3.2) ACGT ( 48.8) ATCC (22.) AGCT ( 3.97) AGCG ( 48.59) AACG (2.8) CCCC ( 33.4) AACC ( 49.32) AACC (2.) AGGG ( 33.) ACAT ( 49.4) ACAT (2.) AGAG ( 33.8) ACAG ( 5.3) AAGC (8.8) AAGC ( 33.3) ACTC ( 5.3) AATC (8.8) ACCC ( 33.6) ACTG ( 5.3) AATG (8.8) ACAG ( 33.67) AGCT ( 5.93) AGGG (8.8) ACTC ( 33.67) ATCG ( 52.) ACAG (8.7) ACTG ( 33.67) AAGC ( 53.9) ACTC (8.7) ACAC ( 34.6) ATAT ( 53.72) ACTG (8.7) ATGC ( 34.3) AAGG ( 54.25) AAAC (7.9) ACGT ( 34.53) AGAT ( 54.34) ACCT (6.8) AACG ( 35.38) AGAG ( 54.96) ATAT (5.8) ATCG ( 35.88) AACG ( 55.27) AAGG (5.6) AGCC ( 36.2) AATC ( 56.6) AGAT (4.5) AGGC ( 36.2) AATG ( 56.6) AGCT (4.5) ACCG ( 38.27) AACT ( 57.6) AAAT (3.7) ACGG ( 38.27) AAGT ( 57.6) AATT (3.7) CCCG ( 4.8) AAAC ( 59.87) AACT (3.6) CCGG ( 4.8) AAAT ( 64.8) AAGT (3.6) AGCG ( 4.87) AATT ( 64.8) AGAG (3.2) ACGC ( 4.36) AAAG ( 64.8) AAAG (2.4) CGCG ( 48.56) AAAA ( 74.64) AAAA (.6) ( 32.44) and CCCG ( 37.33) while that of the protein deformability scale is occupied by CGCG (32.2) and CCCG (28.3). The lowest values correspond for both scales to AAAA ( and.6) and AAAG ( 64.8 and 2.4). All repeat units of length greater than 4 are made up of shorter cyclic paths in the prefix automata and therefore their properties can essentially be predicted from the previous three tables. For all lengths, for instance, the highest level of base stacking energy is achieved by the class ATATATAT... when P is even, and by the class AATATATAT... when P is odd. The lowest level by the 875

12 P.Baldi and P.-F.Baisnée TT AA AC Table 8. Trinucleotide structural scale values for repeat unit p = X X 2 with P = 2. S(p) = S(X X 2 X ) + S(X 2 X X 2 ) TG AG Bendability Position preference TA TC.7.75 AT CA AT (.364) AA (72) AG (.58) CG (5) AC (.34) AT (26) CC (.24) CC (26) CG (.54) AC (23) AA (.548) AG (7).76 GT GG GC GA Fig. 2. Trinucleotide prefix automata for the bendability scale. Circle is used for ease of display but does not represent actual connections. The CAG repeat, for instance, is associated with the cycle CA AG GC CA in the graph and has a total bendability value of =.268. It is the highest bendability value for any triplet repeat. Other edges are not shown. class CGCGCG... when P is even, and CCGCGCG... when P is odd. For protein deformability, the maximal level is achieved by the class CGCGCG... when P is even, and by CCGCGCG... when P is odd. The lowest level is associated with poly-a (i.e. (A) P ). Poly-C and poly-a give also the absolute highest and lowest propeller twist angles at all lengths. Analysis of DNA repeats by trinucleotide scales In the case of trinucleotide scales, the prefix automata contains 6 nodes (Figure 2), each one labeled with a different dinucleotide. All paths, including cycles, of length greater than 6 are composite, i.e. contain at least one cycle of length 6 or less. The trinucleotide scale values for all repeats with periodic unit length P = 2 are given in Table 8. The highest level of bendability is achieved by AT (.364) and the lowest by AA (.548) and CG (.54). The highest level of position preference is achieved by AA (72) and CG (5), and the lowest by AG (7). The trinucleotide scale values for all repeats with periodic unit length P = 3 are given in Table 9 (see also Baldi et al., 999). The highest level of bendability is achieved by the class AGC (.268) and the lowest by AAA (.822) and ACC(.238). In fact only two classes of CT CG CC repeats (AGC and ATC) have positive bendability and are well separated from the rest. The highest level of position preference is achieved by the class AAA (8) followed by CCG (72), and the lowest by AGG and ACC (2). The class AGC, which contains the CAG repeat responsible for the majority of the known triplet repeat expansion diseases, has the highest bendability. It is the only repeat class for which all three shifted triplets have a high individual bendability. Moreover, this class has a relatively low position preference value, another sign of flexibility. Therefore one can hypothesize that long CAG repeats correspond to stretches of DNA that are highly flexible in all positions. Consistently with their high flexibility, CAG/CTG repeats have been found to have the highest affinity for histones among all possible triplet repeats (Wang and Griffith, 994, 995; Godde and Wolffe, 996). Other DNA sequences can adopt long range curvature only if they contain highly flexible triplets in phase with the helical pitch (roughly every.5 bp). The flexibility of extended CAG repeats has been verified experimentally (Chastain and Sinden, 998). The CCG class, which contains the disease-related triplets CGG and GCC, is found at the high (rigid) end of the position preference scale (72), exceeded only by poly-a. This class is also stiff according to the bendability scale (.6). This is consistent with the fact that CGG/CCG repeats seem completely unable to form nucleosomes (Wang et al., 996; Godde et al., 996). The AAG class, which contains the disease related triplet GAA, occupies the lower (flexible) end of the position preference scale (27). It is the second lowest considering that the last two classes have the same value (2). We also note that AAA/TTT is by large the stiffest of all possible repeats according to both scales. Such homopolymeric tracts are known from X-ray crystallography to be rigid and straight (Nelson et al., 987) and they are bad candidates for nucleosome positioning. In fact, a number of promoters in yeast contain homopolymeric da:dt elements. Studies in two different yeast species have shown that the homopolymeric elements destabilize nucleosomes and thereby facilitate 876

Likelihood-Based Phylogenetic Inference

Likelihood-Based Phylogenetic Inference Likelihood-Based Phylogenetic Inference John P. Huelsenbeck (UC Berkeley) #NEXUS begin data; dimensions ntax=5 nchar=895; format gap=- datatype=dna; matrix Human AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTT...AACCCAAACAACCCAGCTCTCCCTAAGCTT

More information

Practical Bioinformatics

Practical Bioinformatics 5/2/2017 Dictionaries d i c t i o n a r y = { A : T, T : A, G : C, C : G } d i c t i o n a r y [ G ] d i c t i o n a r y [ N ] = N d i c t i o n a r y. h a s k e y ( C ) Dictionaries g e n e t i c C o

More information

In silico detection of trna sequence features characteristic to aminoacyl-trna synthetase class membership

In silico detection of trna sequence features characteristic to aminoacyl-trna synthetase class membership Nucleic Acids Research Advance Access published August 17, 2007 Nucleic Acids Research, 2007, 1 17 doi:10.1093/nar/gkm598 In silico detection of trna sequence features characteristic to aminoacyl-trna

More information

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS 1 Prokaryotes and Eukaryotes 2 DNA and RNA 3 4 Double helix structure Codons Codons are triplets of bases from the RNA sequence. Each triplet defines an amino-acid.

More information

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm Electronic Supplementary Material (ESI) for Nanoscale. This journal is The Royal Society of Chemistry 2018 High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

Crick s early Hypothesis Revisited

Crick s early Hypothesis Revisited Crick s early Hypothesis Revisited Or The Existence of a Universal Coding Frame Ryan Rossi, Jean-Louis Lassez and Axel Bernal UPenn Center for Bioinformatics BIOINFORMATICS The application of computer

More information

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA SUPPORTING INFORMATION FOR SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA Aik T. Ooi, Cliff I. Stains, Indraneel Ghosh *, David J. Segal

More information

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models) Regulatory Sequence Analysis Sequence models (Bernoulli and Markov models) 1 Why do we need random models? Any pattern discovery relies on an underlying model to estimate the random expectation. This model

More information

Dynamics of Nucleic Acids Analyzed from Base Pair Geometry

Dynamics of Nucleic Acids Analyzed from Base Pair Geometry Dynamics of Nucleic Acids Analyzed from Base Pair Geometry Dhananjay Bhattacharyya Biophysics Division Saha Institute of Nuclear Physics Kolkata 700064, INDIA dhananjay.bhattacharyya@saha.ac.in Definition

More information

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc Supplemental Figure 1. Prediction of phloem-specific MTK1 expression in Arabidopsis shoots and roots. The images and the corresponding numbers showing absolute (A) or relative expression levels (B) of

More information

Supplementary Information for

Supplementary Information for Supplementary Information for Evolutionary conservation of codon optimality reveals hidden signatures of co-translational folding Sebastian Pechmann & Judith Frydman Department of Biology and BioX, Stanford

More information

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin International Journal of Genetic Engineering and Biotechnology. ISSN 0974-3073 Volume 2, Number 1 (2011), pp. 109-114 International Research Publication House http://www.irphouse.com Characterization of

More information

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr), 48 3 () Vol. 48 No. 3 2009 5 Journal of Xiamen University (Nat ural Science) May 2009 SSR,,,, 3 (, 361005) : SSR. 21 516,410. 60 %96. 7 %. (),(Between2groups linkage method),.,, 11 (),. 12,. (, ), : 0.

More information

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics 582746 Modelling and Analysis in Bioinformatics Lecture 1: Genomic k-mer Statistics Juha Kärkkäinen 06.09.2016 Outline Course introduction Genomic k-mers 1-Mers 2-Mers 3-Mers k-mers for Larger k Outline

More information

Non-deterministic self-assembly with asymmetric interactions

Non-deterministic self-assembly with asymmetric interactions Non-deterministic sel-assembly with asymmetric interactions S. Tesoro, K. Göprich, T. Kartanas, U. F. Keyser, and S. E. Ahnert Cavendish Laboratory, University o Cambridge, JJ Thomson Avenue, CB3 HE Cambridge,

More information

Number-controlled spatial arrangement of gold nanoparticles with

Number-controlled spatial arrangement of gold nanoparticles with Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2016 Number-controlled spatial arrangement of gold nanoparticles with DNA dendrimers Ping Chen,*

More information

SUPPLEMENTARY DATA - 1 -

SUPPLEMENTARY DATA - 1 - - 1 - SUPPLEMENTARY DATA Construction of B. subtilis rnpb complementation plasmids For complementation, the B. subtilis rnpb wild-type gene (rnpbwt) under control of its native rnpb promoter and terminator

More information

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture.

Clay Carter. Department of Biology. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. QuickTime and a TIFF (Uncompressed) decompressor are needed to see this picture. Clay Carter Department of Biology QuickTime and a TIFF (LZW) decompressor are needed to see this picture. Ornamental tobacco

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, Networks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Electronic supplementary material

Electronic supplementary material Applied Microbiology and Biotechnology Electronic supplementary material A family of AA9 lytic polysaccharide monooxygenases in Aspergillus nidulans is differentially regulated by multiple substrates and

More information

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1 Supplementary Figure 1 Zn 2+ -binding sites in USP18. (a) The two molecules of USP18 present in the asymmetric unit are shown. Chain A is shown in blue, chain B in green. Bound Zn 2+ ions are shown as

More information

NSCI Basic Properties of Life and The Biochemistry of Life on Earth

NSCI Basic Properties of Life and The Biochemistry of Life on Earth NSCI 314 LIFE IN THE COSMOS 4 Basic Properties of Life and The Biochemistry of Life on Earth Dr. Karen Kolehmainen Department of Physics CSUSB http://physics.csusb.edu/~karen/ WHAT IS LIFE? HARD TO DEFINE,

More information

Base Motif Recognition and Design of DNA Templates for Fluorescent Silver Clusters by Machine Learning **

Base Motif Recognition and Design of DNA Templates for Fluorescent Silver Clusters by Machine Learning ** DOI: 10.1002/adma.xxxxxxxxx Base Motif Recognition and Design of DNA Templates for Fluorescent Silver Clusters by Machine Learning ** By Stacy M. Copp 1, Petko Bogdanov 2, Mark Debord 1, Ambuj Singh 2,3,

More information

Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank

Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank Topographic Mapping and Dimensionality Reduction of Binary Tensor Data of Arbitrary Rank Peter Tiňo, Jakub Mažgút, Hong Yan, Mikael Bodén Topographic Mapping and Dimensionality Reduction of Binary Tensor

More information

Supplemental Material

Supplemental Material Supplemental Material Protocol S1 The PromPredict algorithm considers the average free energy over a 100nt window and compares it with the the average free energy over a downstream 100nt window separated

More information

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy

Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Supporting Information Building a Multifunctional Aptamer-Based DNA Nanoassembly for Targeted Cancer Therapy Cuichen Wu,, Da Han,, Tao Chen,, Lu Peng, Guizhi Zhu,, Mingxu You,, Liping Qiu,, Kwame Sefah,

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION DOI:.8/NCHEM. Conditionally Fluorescent Molecular Probes for Detecting Single Base Changes in Double-stranded DNA Sherry Xi Chen, David Yu Zhang, Georg Seelig. Analytic framework and probe design.. Design

More information

Supplementary Information

Supplementary Information Electronic Supplementary Material (ESI) for RSC Advances. This journal is The Royal Society of Chemistry 2014 Directed self-assembly of genomic sequences into monomeric and polymeric branched DNA structures

More information

Supporting Information

Supporting Information Supporting Information T. Pellegrino 1,2,3,#, R. A. Sperling 1,#, A. P. Alivisatos 2, W. J. Parak 1,2,* 1 Center for Nanoscience, Ludwig Maximilians Universität München, München, Germany 2 Department of

More information

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)-

Supporting Information for. Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)- Supporting Information for Initial Biochemical and Functional Evaluation of Murine Calprotectin Reveals Ca(II)- Dependence and Its Ability to Chelate Multiple Nutrient Transition Metal Ions Rose C. Hadley,

More information

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R AAC MGG ATT AGA TAC CCK G GGY TAC CTT GTT ACG ACT T Detection of Candidatus

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION SUPPLEMENTARY INFORMATION DOI:.38/NCHEM.246 Optimizing the specificity of nucleic acid hyridization David Yu Zhang, Sherry Xi Chen, and Peng Yin. Analytic framework and proe design 3.. Concentration-adjusted

More information

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole

The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded DNA Sequence Is Self-Designed as a Numerical Whole Applied Mathematics, 2013, 4, 37-53 http://dx.doi.org/10.4236/am.2013.410a2004 Published Online October 2013 (http://www.scirp.org/journal/am) The 3 Genomic Numbers Discovery: How Our Genome Single-Stranded

More information

Supplemental Figure 1.

Supplemental Figure 1. A wt spoiiiaδ spoiiiahδ bofaδ B C D E spoiiiaδ, bofaδ Supplemental Figure 1. GFP-SpoIVFA is more mislocalized in the absence of both BofA and SpoIIIAH. Sporulation was induced by resuspension in wild-type

More information

TM1 TM2 TM3 TM4 TM5 TM6 TM bp

TM1 TM2 TM3 TM4 TM5 TM6 TM bp a 467 bp 1 482 2 93 3 321 4 7 281 6 21 7 66 8 176 19 12 13 212 113 16 8 b ATG TCA GGA CAT GTA ATG GAG GAA TGT GTA GTT CAC GGT ACG TTA GCG GCA GTA TTG CGT TTA ATG GGC GTA GTG M S G H V M E E C V V H G T

More information

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads 1st International Conference on Algorithms for Computational Biology AlCoB 2014 Tarragona, Spain, July 1-3, 2014 Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads Claire

More information

Codon Distribution in Error-Detecting Circular Codes

Codon Distribution in Error-Detecting Circular Codes life Article Codon Distribution in Error-Detecting Circular Codes Elena Fimmel, * and Lutz Strüngmann Institute for Mathematical Biology, Faculty of Computer Science, Mannheim University of Applied Sciences,

More information

Phylogenetic invariants versus classical phylogenetics

Phylogenetic invariants versus classical phylogenetics Phylogenetic invariants versus classical phylogenetics Marta Casanellas Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya Algebraic

More information

Protein Threading. Combinatorial optimization approach. Stefan Balev.

Protein Threading. Combinatorial optimization approach. Stefan Balev. Protein Threading Combinatorial optimization approach Stefan Balev Stefan.Balev@univ-lehavre.fr Laboratoire d informatique du Havre Université du Havre Stefan Balev Cours DEA 30/01/2004 p.1/42 Outline

More information

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval

Evolvable Neural Networks for Time Series Prediction with Adaptive Learning Interval Evolvable Neural Networs for Time Series Prediction with Adaptive Learning Interval Dong-Woo Lee *, Seong G. Kong *, and Kwee-Bo Sim ** *Department of Electrical and Computer Engineering, The University

More information

Why do more divergent sequences produce smaller nonsynonymous/synonymous

Why do more divergent sequences produce smaller nonsynonymous/synonymous Genetics: Early Online, published on June 21, 2013 as 10.1534/genetics.113.152025 Why do more divergent sequences produce smaller nonsynonymous/synonymous rate ratios in pairwise sequence comparisons?

More information

Supplemental Table 1. Primers used for cloning and PCR amplification in this study

Supplemental Table 1. Primers used for cloning and PCR amplification in this study Supplemental Table 1. Primers used for cloning and PCR amplification in this study Target Gene Primer sequence NATA1 (At2g393) forward GGG GAC AAG TTT GTA CAA AAA AGC AGG CTT CAT GGC GCC TCC AAC CGC AGC

More information

part 3: analysis of natural selection pressure

part 3: analysis of natural selection pressure part 3: analysis of natural selection pressure markov models are good phenomenological codon models do have many benefits: o principled framework for statistical inference o avoiding ad hoc corrections

More information

Bio nformatics. Lecture 16. Saad Mneimneh

Bio nformatics. Lecture 16. Saad Mneimneh Bio nformatics Lecture 16 DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece

More information

Using algebraic geometry for phylogenetic reconstruction

Using algebraic geometry for phylogenetic reconstruction Using algebraic geometry for phylogenetic reconstruction Marta Casanellas i Rius (joint work with Jesús Fernández-Sánchez) Departament de Matemàtica Aplicada I Universitat Politècnica de Catalunya IMA

More information

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton.

part 4: phenomenological load and biological inference. phenomenological load review types of models. Gαβ = 8π Tαβ. Newton. 2017-07-29 part 4: and biological inference review types of models phenomenological Newton F= Gm1m2 r2 mechanistic Einstein Gαβ = 8π Tαβ 1 molecular evolution is process and pattern process pattern MutSel

More information

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies

3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies. 3. Evolution makes sense of homologies Richard Owen (1848) introduced the term Homology to refer to structural similarities among organisms. To Owen, these similarities indicated that organisms were created following a common plan or archetype.

More information

The role of the FliD C-terminal domain in pentamer formation and

The role of the FliD C-terminal domain in pentamer formation and The role of the FliD C-terminal domain in pentamer formation and interaction with FliT Hee Jung Kim 1,2,*, Woongjae Yoo 3,*, Kyeong Sik Jin 4, Sangryeol Ryu 3,5 & Hyung Ho Lee 1, 1 Department of Chemistry,

More information

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria

evoglow - express N kit distributed by Cat.#: FP product information broad host range vectors - gram negative bacteria evoglow - express N kit broad host range vectors - gram negative bacteria product information distributed by Cat.#: FP-21020 Content: Product Overview... 3 evoglow express N -kit... 3 The evoglow -Fluorescent

More information

GEOMETRY OF THE KIMURA 3-PARAMETER MODEL arxiv:math/ v1 [math.ag] 27 Feb 2007

GEOMETRY OF THE KIMURA 3-PARAMETER MODEL arxiv:math/ v1 [math.ag] 27 Feb 2007 GEOMETRY OF THE KIMURA 3-PARAMETER MODEL arxiv:math/0702834v1 [math.ag] 27 Feb 2007 MARTA CASANELLAS AND JESÚS FERNÁNDEZ-SÁNCHEZ Abstract. TheKimura3-parametermodelonatreeofnleavesisoneofthemost used in

More information

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective

Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective Jacobs University Bremen Encoding of Amino Acids and Proteins from a Communications and Information Theoretic Perspective Semester Project II By: Dawit Nigatu Supervisor: Prof. Dr. Werner Henkel Transmission

More information

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria

evoglow - express N kit Cat. No.: product information broad host range vectors - gram negative bacteria evoglow - express N kit broad host range vectors - gram negative bacteria product information Cat. No.: 2.1.020 evocatal GmbH 2 Content: Product Overview... 4 evoglow express N kit... 4 The evoglow Fluorescent

More information

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila

Evolutionary dynamics of abundant stop codon readthrough in Anopheles and Drosophila biorxiv preprint first posted online May. 3, 2016; doi: http://dx.doi.org/10.1101/051557. The copyright holder for this preprint (which was not peer-reviewed) is the author/funder. All rights reserved.

More information

The Trigram and other Fundamental Philosophies

The Trigram and other Fundamental Philosophies The Trigram and other Fundamental Philosophies by Weimin Kwauk July 2012 The following offers a minimal introduction to the trigram and other Chinese fundamental philosophies. A trigram consists of three

More information

Nature Genetics: doi:0.1038/ng.2768

Nature Genetics: doi:0.1038/ng.2768 Supplementary Figure 1: Graphic representation of the duplicated region at Xq28 in each one of the 31 samples as revealed by acgh. Duplications are represented in red and triplications in blue. Top: Genomic

More information

Evolutionary Analysis of Viral Genomes

Evolutionary Analysis of Viral Genomes University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

More information

Pattern Matching (Exact Matching) Overview

Pattern Matching (Exact Matching) Overview CSI/BINF 5330 Pattern Matching (Exact Matching) Young-Rae Cho Associate Professor Department of Computer Science Baylor University Overview Pattern Matching Exhaustive Search DFA Algorithm KMP Algorithm

More information

Seeking Significant Oligomers via Set Partitions Expected Count. Stephen Sauchi Lee*

Seeking Significant Oligomers via Set Partitions Expected Count. Stephen Sauchi Lee* International Journal of Computational Science 1992-6669 (Print) 1992-6677 (Online) www.gip.hk/ijcs 2008 Global Information Publisher (H.K) Co., Ltd. All rights reserved. Seeking Significant Oligomers

More information

Introduction to Molecular Phylogeny

Introduction to Molecular Phylogeny Introduction to Molecular Phylogeny Starting point: a set of homologous, aligned DNA or protein sequences Result of the process: a tree describing evolutionary relationships between studied sequences =

More information

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi

Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi Supporting Information http://www.genetics.org/cgi/content/full/genetics.110.116228/dc1 Sex-Linked Inheritance in Macaque Monkeys: Implications for Effective Population Size and Dispersal to Sulawesi Ben

More information

Species Tree Inference using SVDquartets

Species Tree Inference using SVDquartets Species Tree Inference using SVDquartets Laura Kubatko and Dave Swofford May 19, 2015 Laura Kubatko SVDquartets May 19, 2015 1 / 11 SVDquartets In this tutorial, we ll discuss several different data types:

More information

Timing molecular motion and production with a synthetic transcriptional clock

Timing molecular motion and production with a synthetic transcriptional clock Timing molecular motion and production with a synthetic transcriptional clock Elisa Franco,1, Eike Friedrichs 2, Jongmin Kim 3, Ralf Jungmann 2, Richard Murray 1, Erik Winfree 3,4,5, and Friedrich C. Simmel

More information

Lecture 15. Saad Mneimneh

Lecture 15. Saad Mneimneh Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F, Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string

More information

From DNA to protein, i.e. the central dogma

From DNA to protein, i.e. the central dogma From DNA to protein, i.e. the central dogma DNA RNA Protein Biochemistry, chapters1 5 and Chapters 29 31. Chapters 2 5 and 29 31 will be covered more in detail in other lectures. ph, chapter 1, will be

More information

Re- engineering cellular physiology by rewiring high- level global regulatory genes

Re- engineering cellular physiology by rewiring high- level global regulatory genes Re- engineering cellular physiology by rewiring high- level global regulatory genes Stephen Fitzgerald 1,2,, Shane C Dillon 1, Tzu- Chiao Chao 2, Heather L Wiencko 3, Karsten Hokamp 3, Andrew DS Cameron

More information

The Physical Language of Molecules

The Physical Language of Molecules The Physical Language of Molecules How do molecular codes emerge and evolve? International Workshop on Bio Soft Matter Tokyo, 2008 Biological information is carried by molecules Self replicating information

More information

Dihedral Reductions of Cyclic DNA Sequences

Dihedral Reductions of Cyclic DNA Sequences Symmetry 2015, 7, 67-88; doi:10.3390/sym7010067 OPEN ACCESS symmetry ISSN 2073-8994 www.mdpi.com/journal/symmetry Article Dihedral Reductions of Cyclic DNA Sequences Marlos A.G. Viana Symmetry Studies

More information

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR

AtTIL-P91V. AtTIL-P92V. AtTIL-P95V. AtTIL-P98V YFP-HPR Online Resource 1. Primers used to generate constructs AtTIL-P91V, AtTIL-P92V, AtTIL-P95V and AtTIL-P98V and YFP(HPR) using overlapping PCR. pentr/d- TOPO-AtTIL was used as template to generate the constructs

More information

Lecture IV A. Shannon s theory of noisy channels and molecular codes

Lecture IV A. Shannon s theory of noisy channels and molecular codes Lecture IV A Shannon s theory of noisy channels and molecular codes Noisy molecular codes: Rate-Distortion theory S Mapping M Channel/Code = mapping between two molecular spaces. Two functionals determine

More information

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line

ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line PRODUCT DATASHEET ChemiScreen CaS Calcium Sensor Receptor Stable Cell Line CATALOG NUMBER: HTS137C CONTENTS: 2 vials of mycoplasma-free cells, 1 ml per vial. STORAGE: Vials are to be stored in liquid N

More information

codon substitution models and the analysis of natural selection pressure

codon substitution models and the analysis of natural selection pressure 2015-07-20 codon substitution models and the analysis of natural selection pressure Joseph P. Bielawski Department of Biology Department of Mathematics & Statistics Dalhousie University introduction morphological

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Slide 1 / 54. Gene Expression in Eukaryotic cells

Slide 1 / 54. Gene Expression in Eukaryotic cells Slide 1 / 54 Gene Expression in Eukaryotic cells Slide 2 / 54 Central Dogma DNA is the the genetic material of the eukaryotic cell. Watson & Crick worked out the structure of DNA as a double helix. According

More information

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar

Motif Finding Algorithms. Sudarsan Padhy IIIT Bhubaneswar Motif Finding Algorithms Sudarsan Padhy IIIT Bhubaneswar Outline Gene Regulation Regulatory Motifs The Motif Finding Problem Brute Force Motif Finding Consensus and Pattern Branching: Greedy Motif Search

More information

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE

THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE STATISTICA, anno LXIX, n. 2 3, 2009 THE MATHEMATICAL STRUCTURE OF THE GENETIC CODE: A TOOL FOR INQUIRING ON THE ORIGIN OF LIFE Diego Luis Gonzalez CNR-IMM, Bologna Section, Via Gobetti 101, I-40129, Bologna,

More information

Aoife McLysaght Dept. of Genetics Trinity College Dublin

Aoife McLysaght Dept. of Genetics Trinity College Dublin Aoife McLysaght Dept. of Genetics Trinity College Dublin Evolution of genome arrangement Evolution of genome content. Evolution of genome arrangement Gene order changes Inversions, translocations Evolution

More information

Near-instant surface-selective fluorogenic protein quantification using sulfonated

Near-instant surface-selective fluorogenic protein quantification using sulfonated Electronic Supplementary Material (ESI) for rganic & Biomolecular Chemistry. This journal is The Royal Society of Chemistry 2014 Supplemental nline Materials for ear-instant surface-selective fluorogenic

More information

Supplementary Information

Supplementary Information Supplementary Information Arginine-rhamnosylation as new strategy to activate translation elongation factor P Jürgen Lassak 1,2,*, Eva Keilhauer 3, Max Fürst 1,2, Kristin Wuichet 4, Julia Gödeke 5, Agata

More information

Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass

Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass Supporting Information for Pathways and Controls of N 2 O Production in Nitritation Anammox Biomass Chun Ma, Marlene Mark Jensen, Barth F. Smets, Bo Thamdrup, Department of Biology, University of Southern

More information

DNA Structure. Voet & Voet: Chapter 29 Pages Slide 1

DNA Structure. Voet & Voet: Chapter 29 Pages Slide 1 DNA Structure Voet & Voet: Chapter 29 Pages 1107-1122 Slide 1 Review The four DNA bases and their atom names The four common -D-ribose conformations All B-DNA ribose adopt the C2' endo conformation All

More information

Symmetry Studies. Marlos A. G. Viana

Symmetry Studies. Marlos A. G. Viana Symmetry Studies Marlos A. G. Viana aaa aac aag aat caa cac cag cat aca acc acg act cca ccc ccg cct aga agc agg agt cga cgc cgg cgt ata atc atg att cta ctc ctg ctt gaa gac gag gat taa tac tag tat gca gcc

More information

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Interpolated Markov Models for Gene Finding. BMI/CS 776  Spring 2015 Colin Dewey Interpolated Markov Models for Gene Finding BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 2015 Colin Dewey cdewey@biostat.wisc.edu Goals for Lecture the key concepts to understand are the following the

More information

Prof. Christian MICHEL

Prof. Christian MICHEL CIRCULAR CODES IN GENES AND GENOMES - 2013 - Prof. Christian MICHEL Theoretical Bioinformatics ICube University of Strasbourg, CNRS France c.michel@unistra.fr http://dpt-info.u-strasbg.fr/~c.michel/ Prof.

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry

Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry Electronic Supporting Information: Chain-like assembly of gold nanoparticles on artificial DNA templates via Click Chemistry Monika Fischler, Alla Sologubenko, Joachim Mayer, Guido Clever, Glenn Burley,

More information

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation.

Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation. Supplementary Figure 1. Schematic of split-merger microfluidic device used to add transposase to template drops for fragmentation. Inlets are labelled in blue, outlets are labelled in red, and static channels

More information

Biology 644: Bioinformatics

Biology 644: Bioinformatics A stochastic (probabilistic) model that assumes the Markov property Markov property is satisfied when the conditional probability distribution of future states of the process (conditional on both past

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Supporting Information. An Electric Single-Molecule Hybridisation Detector for short DNA Fragments

Supporting Information. An Electric Single-Molecule Hybridisation Detector for short DNA Fragments Supporting Information An Electric Single-Molecule Hybridisation Detector for short DNA Fragments A.Y.Y. Loh, 1 C.H. Burgess, 2 D.A. Tanase, 1 G. Ferrari, 3 M.A. Maclachlan, 2 A.E.G. Cass, 1 T. Albrecht*

More information

It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015.

It is the author's version of the article accepted for publication in the journal Biosystems on 03/10/2015. It is the author's version of the article accepted for publication in the journal "Biosystems" on 03/10/2015. The system-resonance approach in modeling genetic structures Sergey V. Petoukhov Institute

More information

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC

ydci GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC Table S1. DNA primers used in this study. Name ydci P1ydcIkd3 Sequence GTC TGT TTG AAC GCG GGC GAC TGG GCG CGC AAT TAA CGG TGT GTA GGC TGG AGC TGC TTC Kd3ydcIp2 lacz fusion YdcIendP1 YdcItrgP2 GAC AGC

More information

Stochastic processes and Markov chains (part II)

Stochastic processes and Markov chains (part II) Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Amsterdam, The

More information

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab)

Codon-model based inference of selection pressure. (a very brief review prior to the PAML lab) Codon-model based inference of selection pressure (a very brief review prior to the PAML lab) an index of selection pressure rate ratio mode example dn/ds < 1 purifying (negative) selection histones dn/ds

More information

Title: Robust analysis of synthetic label-free DNA junctions in solution by X-ray scattering and molecular simulation

Title: Robust analysis of synthetic label-free DNA junctions in solution by X-ray scattering and molecular simulation Supplementary Information Title: Robust analysis of synthetic label-free DNA junctions in solution by X-ray scattering and molecular simulation Kyuhyun Im 1,5, Daun Jeong 2,5, Jaehyun Hur 1, Sung-Jin Kim

More information

Insects act as vectors for a number of important diseases of

Insects act as vectors for a number of important diseases of pubs.acs.org/synthbio Novel Synthetic Medea Selfish Genetic Elements Drive Population Replacement in Drosophila; a Theoretical Exploration of Medea- Dependent Population Suppression Omar S. Abari,,# Chun-Hong

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the

Biosynthesis of Bacterial Glycogen: Primary Structure of Salmonella typhimurium ADPglucose Synthetase as Deduced from the JOURNAL OF BACTERIOLOGY, Sept. 1987, p. 4355-4360 0021-9193/87/094355-06$02.00/0 Copyright X) 1987, American Society for Microbiology Vol. 169, No. 9 Biosynthesis of Bacterial Glycogen: Primary Structure

More information

Casting Polymer Nets To Optimize Molecular Codes

Casting Polymer Nets To Optimize Molecular Codes Casting Polymer Nets To Optimize Molecular Codes The physical language of molecules Mathematical Biology Forum (PRL 2007, PNAS 2008, Phys Bio 2008, J Theo Bio 2007, E J Lin Alg 2007) Biological information

More information