BIOINFORMATICS. Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths

Size: px

Start display at page:

Download "BIOINFORMATICS. Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths"

Easter Hunt
6 years ago
Views:

1 BIOINFORMATICS Vol. 6 no. 2 Pages Sequence analysis by additive scales: DNA structure for sequences and repeats of all lengths Pierre Baldi, 2, and Pierre-François Baisnée Department of Information and Computer Science and 2 Department of Biological Chemistry, College of Medicine, University of California, Irvine, CA , USA Received on April 24, 2; accepted on May 25, 2 Abstract Motivation: DNA structure plays an important role in a variety of biological processes. Different di- and trinucleotide scales have been proposed to capture various aspects of DNA structure including base stacking energy, propeller twist angle, protein deformability, bendability, and position preference. Yet, a general framework for the computational analysis and prediction of DNA structure is still lacking. Such a framework should in particular address the following issues: () construction of sequences with extremal properties; (2) quantitative evaluation of sequences with respect to a given genomic background; (3) automatic extraction of extremal sequences and profiles from genomic databases; (4) distribution and asymptotic behavior as the length N of the sequences increases; and (5) complete analysis of correlations between scales. Results: We develop a general framework for sequence analysis based on additive scales, structural or other, that addresses all these issues. We show how to construct extremal sequences and calibrate scores for automatic genomic and database extraction. We show that distributions rapidly converge to normality as N increases. Pairwise correlations between scales depend both on background distribution and sequence length and rapidly converge to an analytically predictable asymptotic value. For di- and tri-nucleotide scales, normal behavior and asymptotic correlation values are attained over a characteristic window length of about 5 bp. With a uniform background distribution, pairwise correlations between empirically-derived scales remain relatively small and roughly constant at all lengths, except for propeller twist and protein deformability which are positively correlated. There is a positive (resp. negative) correlation between dinucleotide base stacking (resp. propeller twist and protein deformability) and AT-content that increases in magnitude with length. The framework is applied to the analysis of various DNA tandem repeats. We derive exact expressions for counting the number of repeat unit classes To whom all correspondence should be addressed. at all lengths. Tandem repeats are likely to result from a variety of different mechanisms, a fraction of which is likely to depend on profiles characterized by extreme structural features. Contact: pfbaldi@ics.uci.edu; baisnee@ics.uci.edu Introduction Evidence is mounting that DNA structural properties beyond the double helical pattern play an important role in a number of fundamental biological processes, both under healthy and pathological conditions. This is not too surprising if one realizes that meters of DNA must be compacted into a nucleus that is only a few microns in diameter while, at the same time, preserving the ability of turning thousands of genes on and off in a precisely orchestrated fashion. The threedimensional structure of DNA, as well as its organization into chromatin fibers, seems to be essential to its functions and has been implicated in diverse phenomena ranging from protein binding sites, to gene regulation, to triplet repeat expansion diseases. The goal of this work is to develop computational methods for the structural analysis of DNA sequences. While DNA structure is our primary motivation and area of application, the framework we develop is completely general and applies to sequences over any alphabet, including codon, RNA, and protein alphabets, whenever local additive scales, as defined below, are available. DNA structure DNA structure has been found to depend on the exact sequence of nucleotides, an effect that seems to be caused largely by interactions between neighboring base pairs (Ornstein et al., 978; Satchwell et al., 986; Breslauer et al., 986; Calladine et al., 988; Goodsell and Dickerson, 994; Sinden, 994; Brukner et al., 995; Hassan and Calladine, 996; Hunter, 996; Ponomarenko et al., 999; Fye and Benham, 999). This means that different sequences can have different intrinsic structures, or different propensities for forming particular structures. c Oxford University Press 2 865

2 P.Baldi and P.-F.Baisnée Periodic repetitions of bent DNA in phase with the helical pitch, for instance, will cause DNA to assume a macroscopically curved structure. Flexible or intrinsically curved DNA is energetically more favorable to wrap around histones than rigid and unbent DNA, and this has been shown to influence nucleosome positioning (Drew and Travers, 985; Satchwell et al., 986; Simpson, 99; Lu et al., 994; Wolffe and Drew, 995; Baldi et al., 996; Zhu and Thiele, 996; Liu and Stein, 997). In addition, the chromatin complex structure of DNA and the positioning of nucleosomes along the genome have been found to play an important (generally inhibitory) role in the regulation of gene transcription (Pazin and Kadonaga, 997; Tsukiyama and Wu, 997; Werner and Burley, 997; Pedersen et al., 998). Sequence-dependent DNA structure is often important for DNA binding proteins, such as TBP (TATA-binding-protein) (Parvin et al., 995; Starr et al., 995; Grove et al., 996) and gene regulation (Sheridan et al., 998). While the number of resolved structures of DNA protein complexes continues to grow in the PDB database, the field of computational DNA structural analysis is clearly far behind its protein cousin and completely lacks any degree of systemicity. Most likely, most DNA structural signals remain to be uncovered. DNA structural scales Based on many different empirical measurements or theoretical approaches, several models have been constructed that relate the nucleotide sequence to DNA flexibility and curvature (Ornstein et al., 978; Satchwell et al., 986; Goodsell and Dickerson, 994; Sinden, 994; Brukner et al., 995; Hassan and Calladine, 996; Hunter, 996; Baldi et al., 998; Ponomarenko et al., 999). These models are typically in the form of dinucleotide or trinucleotide scales that assign a particular value to each di- or tri-nucleotide and its reverse complement. A non-exhaustive list of such scales includes: () The dinucleotide base stacking energy (BS) scale (Ornstein et al., 978) expressed in kilocalories per mole. The scale is derived from approximate quantum mechanical calculations on crystal structures. (2) The dinucleotide propeller twist angle (PT) scale (Hassan and Calladine, 996) measured in degrees. This scale is based on X-ray crystallography of DNA oligomers. Dinucleotides with a large negative propeller-twist angle tend to be more rigid than dinucleotides with low negative propeller-twist angle. (3) The dinucleotide protein deformability (PD) scale (Olson et al., 998) derived from empirical energy functions extracted from the fluctuations and correlations of structural parameters in DNA protein crystal complexes. Dinucleotides with large PD values tend to be more flexible. (4) The trinucleotide bendability (B) model (Brukner et al., 995) based on Dnase I cutting frequencies. The enzyme Dnase I preferably binds (to the minor groove) and cuts DNA that is bent, or bendable, towards the major groove (Lahm and Suck, 99; Suck, 994). Thus Dnase I cutting frequencies on naked DNA can be interpreted as a quantitative measure of major groove compressibility or anisotropic bendability. These frequencies allow for the derivation of bendability parameters for the 32 complementary trinucleotide pairs. Large B values correspond to flexibility. (5) The trinucleotide position preference (PP) scale derived from experimental investigations of the positioning of DNA in nucleosomes. It has been found that certain trinucleotides have a strong preference for being positioned in phase with the helical repeat. Depending on the exact rotational position, such triplets will have minor grooves facing either towards or away from the nucleosome core (Satchwell et al., 986). Based on the premise that flexible sequences can occupy any rotational position on nucleosomal DNA, these preference values can be used as a triplet scale that measures DNA flexibility. Hence, in this model, all triplets with close to zero preference are assumed to be flexible, while triplets with preference for facing either in or out are taken to be more rigid and have larger PP values. Note that we do not use this scale as a measure of how well different triplets form nucleosomal DNA. Instead, the absolute value, or unsigned nucleosome positioning preference, is used here, as in Pedersen et al. (998), as a measure of DNA flexibility. For completeness, all these scales are displayed in the appendix. In previous studies, we found these models useful (Baldi et al., 996; Pedersen et al., 2; Baldi et al., 999), in particular for the detection of putative new structural signatures associated with an increase of bendability in downsteam regions of RNA polymerase II promoters. A similar approach (Liao et al., 2) was used to analyze the structure of insertion sites for P transposable elements in Drosophila melanogaster and suggest that the corresponding transposition mechanism recognizes a structural signature rather than a specific sequence motif. With the exception of BS, all the models were determined by purely experimental observations of sequencestructure correlations. Additional scales capturing DNA properties related directly or indirectly to structure, such as enthalpy, or melting temperature, have also been 866

3 Sequence analysis by additive scales proposed (Breslauer et al., 986; Ponomarenko et al., 999). The primary focus of this work is not on assessing the merits and pitfalls of each model, but rather on the development of general methods for the systematic application of any scale to any sequence of any length, up to entire genomes, under the assumption that the scale can be used additively within a sliding window. In general, this assumption will provide a reasonable approximation, at least up to a certain length to be determined experimentally. In particular, we are interested in the development of methods for the automatic recognition of structural motifs associated with extremal features, such as extreme stiffness or bendability. The calibration of corresponding thresholds is expected to be useful in database searches and is conceptually similar, for instance, to the calibration of thresholds for detecting sequence homology. More generally, however, database searches may also be conducted on the basis of structural signatures or profiles that need not be extremal and could be obtained from reasonable training sets. Certain protein binding sites, for instance, are highly degenerate at the DNA sequence level, with low sequence homology, while exhibiting at the same time a high degree of DNA structural similarity. Similarly, periodic flexible triplets in phase with the double helical pitch are necessary to ensure long range curvature, for instance in nucleosome regions. Although several scales may agree on some structural features, the fact remains that they may also display divergent interpretations of some sequence elements. While no final consensus regarding these models exists, it is likely that each one provides a slightly different and partially complementary view of DNA structure. Thus a second goal of this work is the comparison of the models in the limited sense of estimating the statistical correlation between different scales. In Baldi et al. (998) it was shown that by and large many of the commonly used scales exhibit low correlations measured at the level of single dior tri-nucleotides. Empirical measurement of correlations between the scales over longer lengths in Escherichia coli have recently revealed different unexplained patterns (Pedersen et al., 2). Here we provide a complete explanation of this phenomena and show how correlations vary with background distribution and with window length. Finally, while the methods introduced can be applied to any DNA sequence, we focus here on a particularly important class of DNA sequences, namely DNA tandem repeats, where the general framework is further specialized. DNA repeats Genomes, especially eukaryotic genomes, are replete with DNA repetitive regions (Jurka et al., 992; Jeffreys, 997; Jurka, 998). Well over 3% of the human genome has been estimated to comprise repetitive DNA of some sort (Benson and Waterman, 994) the exact function of which is often unknown. Such DNA arises through many different evolutionary and genetic mechanisms. Over 95 different classes (Jurka, 998) of repeats have been censed. Two major groups of repeats exist: interspersed repeats, and tandem repeats. While the methods to be developed can be applied to both groups, our analysis will focus on tandem repeats, consisting of two or more contiguous copies of a particular pattern of nucleotides. Tandem repeats may cover up to % of the human genome. Tandem repeats vary widely, over several orders of magnitude, both in terms of the length of the repeating pattern and the number of more or less exact contiguous copies. Repeats are often polymorphic and therefore play a major role in linkage studies and DNA fingerprinting. In many cases, the genetic origin, the structure, and the function of these repetitive regions is poorly understood. There exist a few examples, however, where the repeats are known to play a biological role in both healthy and pathological conditions. Certain tandem repeats, for instance, have been associated with protein binding sites or interactions with transcription factors. An important advance in epigenetics research has been the realization that interactions between repeated DNA sequences can trigger the formation and the transmission of inactive genetic states and DNA modifications (Wolffe and Matzke, 999). In several of these cases, the particular DNA-helical structural features of the repeat sequences seem to play an essential role. Interest in tandem repeats has been heightened over the last few years by the discovery that several important degenerative disorders including Huntington s disease, myotonic dystrophy, fragile X syndrome, and several forms of ataxia, result from the abnormal expansion of particular DNA triplets (The Huntington s Disease Collaborative Research Group, 993; Ashley and Warren, 995; Ross, 995; Gusella and MacDonald, 996; Hardy and Gwinn-Hardy, 998; Rubinsztein and Hayden, 998; Baldi et al., 999). The exact mechanism by which a triplet repeat mutation causes disease varies as indicated by the fact that currently known repeat expansions are found both in 5 UTRs, in 3 UTRs, in introns, and within coding sequences of various affected genes (Ashley and Warren, 995; Gusella and MacDonald, 996; Rubinsztein and Amos, 998; Rubinsztein and Hayden, 998). For instance, fragile X mental retardation is associated with an expanded CGG repeat in the 5 UTR of the FMR gene (Nelson, 995; Eichler and Nelson, 998). The 64 possible triplets can be clustered into 2 equivalence classes when shift and reverse complement operations are considered (see below). Currently only three repeat classes CAG, CGG, and GAA, out of the possible twelve, are associated with triplet repeat disorders. There is evidence that unusual structural features of the repeats play a role in their expansion (Wells, 996; 867

4 P.Baldi and P.-F.Baisnée Pearson and Sinden, 998a,b; Moore et al., 999). In Baldi et al. (999), the structural scales above were used to show that the triplet classes involved in the diseases have extreme structural characteristics of very high or very low flexibility. Methods to quantify the degree of extremality relative to other sequences, however, were not developed. Furthermore, other triplet or non-triplet repeats may play a role in diseases as well as other biological processes. Therefore the techniques need to be improved and extended to all classes of repeats. Hence, given the importance of repeating patterns and the exponential growth of sequence databases, our goal is also to develop new tools for the computational analysis of the structural properties of arbitrary repeats and begin to apply such techniques in a systematic and quantifiable way. Various algorithms for searching tandem repeats have been developed (Milosavljevic and Jurka, 993; Benson and Waterman, 994; Benson, 999; Blanchard et al., 2). The techniques presented here can also be viewed as complementing such algorithms by introducing a structural perspective. Organization The remainder of the paper is organized as follows. In the next section we develop a general framework for the analysis of the score of a sequence (repetitive or not) under any additive scale. We determine the number of different sequence equivalence classes under circular permutation and reverse complement operations. We show how to determine and visualize maximal and minimal patterns and study the statistical properties of the scales, including intra scale (mean and variance) and interscale (correlations) statistics for sequences of various lengths, as well as asymptotic normality. This framework is essential in order to compare the behavior of various scales, to locate a given sequence with respect to a comparable population, and to automatically set thresholds in database searches. We then apply the general framework to the five structural models described above and various tandem repeats. Methods and theory General framework The general framework we consider begins with an alphabet A of size A and a scale S of length (or size) S. The scale is a function that assigns a value to any S-tuple of the alphabet, for instance in the form of a table with A S entries. In the result section, we deal exclusively with the nucleotide DNA alphabet (A = 4) and with DNA scales, such as dinucleotide with S = 2 (e.g. propeller twist) or trinucleotide with S = 3 (e.g. bendability) structural scales. The same framework, however, can readily be applied to other situations (e.g. amino acid alphabet with hydrophobicity scales). Given a primary sequence s = X X 2...X N of length N S over A, we assume that the scale S is approximately additive in the sense that the corresponding global property of the sequence s can be estimated by sliding the scale along the sequence in the form S(s) = S(X...X S ) + S(X 2...X S+ ) + = N S+ i= S(X i...x i+s ). () In practical applications, such quantity can also be averaged over a window of length W to get, for instance, a more homogeneous per base-pair value (W N). This averaging process does not concern us at this stage since it merely amounts to using a different scale, with a larger size. The form given in equation () corresponds to a free boundary condition. The ideas to be developed can be applied to other boundary conditions, including periodic boundary conditions, where the sequence is wrapped around, as described below. With the proper modifications, the theory applies immediately to the case where the scales are shifted by more than one position at each step. Consider now a repeat sequence r consisting of a unit pattern or period p = (X...X P ) of length P, and repetition number R >, so that r = (X...X P ) R with N = PR S. Notice that the period is not uniquely determined since, for instance, XXXX can be viewed as (X) 4,oras(XX) 2. In addition, we will assume that P + S N, or equivalently that S (R )P + so that the scale S is applied starting at least once from each letter in the repetitive unit, without exceeding the repetitive sequence boundary. In this case, S(r) has the form: S(r) = ls(p) + ɛ (2) where S(p) is the contribution of the periodic unit S(p) = S(X...X S ) + S(X 2...X S+ ) + +S(X P X...X S ) (3) P = S(X i...x i+s ) [modp]. i= The number l of times the periodic unit is covered by S and its shifted version is given by: PR S + S l = = R. (4) P P Finally, if lp + S = RP then the boundary tail ɛ is equal to. Otherwise ɛ = S(X lp+...x lp+s ) + +S(X RP S+...X RP ) (5) 868

5 Sequence analysis by additive scales where indices can be taken modulo P, i.e. X lp+ = X and so forth. The sum in equation (5) has at most P terms. In practice, at least in the case of DNA, only short scales are currently available and therefore in most cases, S P +. In this case, equation (2) simplifies to: S(r) = (R )S(p) + ɛ. (6) Equivalence classes In the special case of repetitive sequences, we also need to be able to count the number of different repeats with respect to a given scale. It is often the case that the scale S is characterized by some kind of invariance with respect to the sequences of length S of A. In the case of DNA, the structural scales we have are invariant with respect to the reverse complement. When looking at repeat sequences, this determines how many different repeat patterns of length P need to be considered. A triplet repeat, for instance, can be described in terms of different unit trinucleotides depending on what strand and triplet frame is chosen. Thus, the repeat CAGCAGCAG... can be said to be a repeat of the triplet CAG, and also of its reverse complement CTG. Ignoring repeat boundaries, however, the sequence can also be described as a repeat of the shifted triplet pairs AGC/GCT and GCA/TGC. In this way, the 64 different trinucleotides can be divided into 2 possible repeat classes. Of these 2 classes, only are proper triplet repeat classes in the sense that they do not result from a repeat pattern of shorter length. The two classes associated with shorter patterns are obviously the triplet pairs AAA/TTT and CCC/GGG which are more precisely described as mononucleotide repeats. [For a generic alphabet A, a reverse complement operation can be defined by introducing a one to one function X X from the alphabet to itself, satisfying X = X so that the reverse complement of X...X N is defined to be X N,..., X.] In the case of a DNA repeat with unit repeat length P, the number of classes and the number of elements in each equivalence class is dictated by the action of the group of transformations associated with the circular permutations and the reverse complement operations on the set of all possible strings of length P. AAA.../TTT... and CCC.../GGG... always give rise to two separate classes with two elements each. In general, a typical class will contain 2P elements associated with the P permutations and the P reverse complements. Classes containing less elements, however, can arise for instance as a result of sub-periodicity effects when P is not prime, and of identical reverse complement effects. For instance, when P = 4, the class of ATAT contains only two elements since it is identical to its reverse complement and can be shifted circularly only once before returning to the original pattern. The number of classes can be counted using standard group theory arguments detailed in the appendix. These arguments are not restricted to circular permutation and reverse complement operations, but apply to any group of transformations over any sequences. The number of classes, when only circular permutations without reverse complement are taken into account, is given by P φ d P ( ) P A d = d P k P A (P,k) (7) where (P, k) is the greatest common divider (gcd) of P and k. φ(n) is the Euler function counting the number of integers less than n which are prime to n, i.e. without common dividers with n. If p,...,p k is the list of distinct prime factors of n, then the Euler function can be expressed as: φ(n) = n k ) ( pi. (8) i= When both circular permutations and reverse complement are taken into account, the number of classes for odd P is given by 2P φ d P ( ) P A d = d 2P k P A (P,d). (9) When P is even, the corresponding number of classes is [ ( ] P φ )A d + P2 2P d AP/2 () d P or, equivalently, [ ] A (P,d) + P 2P 2 AP/2. () k P In particular, when P is prime, the number of different classes under periodic and reverse complement equivalence is 2P [(P )A + AP ]. (2) The number of classes which are new at a given length P, i.e. that do not result from the repetition of a shorter pattern of length dividing P, can easily be obtained by subtracting the corresponding counts for each divisor of P. When P is prime, all classes are new except for the classes resulting from mono-letter repeats. Table in the Results section exemplifies the application of equations (9) (2). 869

6 P.Baldi and P.-F.Baisnée Extremal sequences and automata We are interested in the construction and recognition of sequences s that are extremal for S, i.e. such that S(s) is very large or very small relative to the other sequences of length N. For this, we attach to each scale a prefix automata, or prefix graph. The prefix automata can be described by a directed graph containing A S nodes, each labeled by a string of length S overa of the form X...X S (see Figure for an example). Each node has A directed outgoing connections. X...X S is connected to X 2...X S Y, for each letter Y in A, hence the notion of prefix. The weight (or length) of the corresponding transition is provided by the entry associated with X X 2...X S Y in the structural table. The A nodes labeled (X) S = XXX...X (monorepeats) are the only ones to have a self-connection. Any sequence s of length N, is trivially associated with the path: X...X S X 2...X S X N S+2...X N. The value of S(s) is found just by adding the weights of the corresponding connections. As a result, sequences associated with maximal or minimal values of S(s) correspond to paths in the prefix graph, with maximal or minimal total weight or length. These can easily be found by standard dynamic programming techniques which can also be extended to finding, for instance, the k longest or shortest paths. A repeat pattern of length P is a directed cycle in the prefix automata graph. Notice that any path of length greater than A S must intersect itself at least once. Thus any cycle of length strictly greater than A S must be composed of non-intersecting cycles of length at most A S.For instance, with a dinucleotide scale, any repeat unit of length greater than four must contain at least two cycles of length at most four. Therefore in the study of repeats, we need only to study the properties of all non-intersecting directed cycles of length up to A S together with all possible ways of joining them. In addition to dynamic programming techniques, it is also useful to tabulate the weights of all possible short cycles for at least two reasons. First, because longer patterns are built from shorter cycles. Second, at least in the case of DNA, many important existing repeats, such as triplet repeats, are based on a short repeating pattern. While the prefix graph is useful for constructing extremal sequences and recognizing them as long as A, S and N are small, it is also necessary to develop more general techniques by which we can rapidly assess, for any sequence s, the magnitude of S(s) with respect to all the other comparable sequences. This is best achieved by viewing the sequences in a probabilistic context. Probabilistic modeling Consider now that sequences are being generated by a random process. In order to fix the ideas, we take for simplicity a Markov model of order, i.e. we assume that sequences are generated by N tosses of the same die with distribution D = (p X ) over the alphabet A. The same analysis, however, can easily be extended to other probabilistic models such as higher-order Markov models where distributions are defined, for instance, on pairs or triplets of letters. From equation (), S(s) is now a random variable which is the sum of N S + random variables: S(s) = Y + + Y N S+. By construction, all the variables Y i = S(X i...x i+s ) have the same distribution, but they are not independent. Rather they satisfy a form of local dependence, called m-dependence in statistics. More precisely, for i < j, Y i and Y j are independent if and only if j i S. Using the linearity of the expectation, we have: E(S(s)) = (N S + )E(Y i ) Nα S (3) with E(Y i ) = S(X...X S )p(x )...p(x S ) = α S X...X S (4) the sum being over all A S S-tuples of the alphabet. To situate an individual sequence with respect to the entire population, we need to calculate the variance. The variance also can be calculated explicitly by taking advantage of the local dependence of the variables Y i.we have Var(S(s)) = (N S + )Var(Y i ) + 2 Cov(Y i, Y j ) < j i<s (5) with the covariances Cov(Y i, Y j ) = E[(Y i E(Y i ))(Y j E(Y j ))]. As soon as j i S, Y i and Y j are independent and the corresponding covariance is. Thus, for any given scale S, one needs only to tabulate the expectation E(Y i ) and the S relevant short-range covariances C k = C k (S) = Cov(Y i, Y i+k ) (6) for k < S (C = Var(Y i )). Alternatively, by factoring out the variance of Y i, equation (5) can also be expressed in terms of the correlations [ ] Var(S(s)) =Var(Y i ) N S++2 Cor(Y i, Y j ). < j i<s (7) To obtain the exact variance at each length N, it is then only a matter of counting how many times each type of covariance is present in the sequences and adjust for any boundary effects as needed. 87

7 Sequence analysis by additive scales If N 2S, then Var(S(s)) = (N S + )C S +2 (N S k + )C k. (8) If S N < 2S, then k= Var(S(s)) = (N S + )C N S +2 (N S k + )C k. (9) k= It is worth noticing that, for fixed S, both the expectation and the variance are linear in N. In particular, for large N [ ] [ ] S S Var(S(s)) N C + 2 C k = N C k = Nβ S. k= S+ (2) In the last equality, for obvious symmetry reasons, we let C k = C k. This notation will prove to be useful below. In the case of repetitive sequences, it is also useful to calculate the expectation of S(p) = Y + Y P, and its variance with periodic boundary conditions modulo P, i.e. assuming the variables Y...Y P and the corresponding letters are arranged along a circle. Here both the expectation and the variance are directly proportional to P and satisfy E(S(p)) = α S P and Var(S(p)) = β S P. Clearly, for any P, E(S(p)) = PE(Y i ) so If P 2S, S β S = C + 2 C k = α S = E(Y i ). (2) k= S S+ C k. (22) When S P < 2S, all variables along the circle are dependent and therefore β S = β S (P) is given by β S (P) = C + 2 n C k = k= n C k (23) n when P = 2n +, and n β S (P) = C + C n + 2 C k = C n + k= n n+ C k (24) when P = 2n. Periodic boundary conditions must be used in the computation of the covariances C k whenever necessary ( k > P S). For a periodic sequence r, where the period P as well as S are small relative to the length N = RP we can use: E(S(r)) RE(S(p)) (25) and the approximation Var(S(r)) R Var(S(p)). (26) For long repetitive sequences with period P < S, we can use the same approach with a larger period P, multiple of P, so that S P. Central limit theorem. S(s) consists of a sum of identical but non-independent random variables. Therefore standard central limit theorems for sums of independent random variables cannot be applied. Yet, because the dependencies are local, a sum Z = Y + + Y K of K m- dependent random variables Y i still approaches a normal distribution. This can be shown using the theorem in Baldi and Rinott (989) which provides also a bound on the rate of convergence. Here we use the improved bound found in Rinott and Dembo (996). We let max i Y i E(Y i ) =B, and E ( K i= Y i E(Y i ) ) /K = µ. For all the scales to be considered, these constants are well defined and easy to compute. Under these assumptions, ( ) Z EZ P u (u) Var(Z) 7K µ [Var(Z)] 3/2 (2S )2 B 2 (27) where (u) is the normalized Gaussian distribution. The factor (2S ) represents the size of the clusters associated with m-dependence. For a fixed scale, such size is constant but the theorem remains true if S grows slowly with N. Thus equation (27) can readily be applied to S(s) or S(r) with K = N S + ork = RP. From equation (2), the variance of the sequences being considered is linear in their length: Var(S(s)) β N, where β depends only on the scale S. Thus we obtain a convergence rate that scales at most like / N ( ) Z EZ P u (u) Var(Z) C N (28) with C 7µ(2S ) 2 B 2 β 3/2. The rate of this bound is known to be essentially optimal (similar to the Berry Esseen theorems, Feller, 97). Normalized distances and extremal sequences. The value of S(s) or S(r) of any sequence or repeat of length N can be compared to the average value of a background population by computing a normalized Z-score of the form: Z(s) = S(s) α S N. (29) βs N A repeat r with period unit length P and repetition R (N = RP) can be compared to a background population of repeats, or a background population of 87

8 P.Baldi and P.-F.Baisnée generic sequences. In the latter case, we have S(r) S(p)R or S(r) α N. Therefore the Z-score Z(r) = (α α) N β (3) grows with N and is larger than the Z-score Z(p) computed on the repeat unit. In other words, if a repeat unit displays extremal features when compared to other repeat units of the same length, its expansion will appear even more extreme compared to the background of all sequences of similar length. The Z-scores can be used to assess how extreme a sequence is and to search databases for subsequences with extremal features. As in the case of alignments, this can also be done using extreme value distributions (Durbin et al., 998). Note also that one can search a database using a structural profile rather than extreme values. The degree of similarity between two profiles can be measured, for instance, using the standard mean square error. Correlations between scales. It is useful to have some information regarding the degree of correlation between two scales and how such correlation behaves at all sequence lengths. Consider then two scales S and S 2 of length S and S 2. Without any loss of generality assume that S S 2. For sequences s of length N, we are interested in measuring the correlation between the random variables S (s) = Y + +Y N S +,withy i = S (X i...x i+s ), and S 2 (s) = Z + + Z N S2 +, with Z i = S 2 (X i...x i+s2 ).Wehave: Cor (S (s), S 2 (s)) = Cov(S (s), S 2 (s)) Var(S (s)) Var(S 2 (s)). (3) Again only terms of the form Cov(Y i, Z j ), where the distance between i and j is small, are non-zero. More precisely, non-zero terms can arise only if j i S or i j S 2. It is sufficient to tabulate the finite set of S + S 2 covariances Cov(Y i, Z i+k ) C k = C k (S, S 2 ) = E[(Y i E(Y i ))(Z i+k E(Z i+k ))] (32) with S S 2 and S 2 + k S. These covariances can be used to compute correlations at all lengths by writing Cov(S (s), S 2 (s)) = (N S 2 + )Cov(Y i, Z i ) +2 i = j Cov(Y i, Z j ). (33) For large N it is clear that, except for small boundary effects, each type of covariance occurs approximately N times in the formula above. Therefore for large N, Cov(S (s), S 2 (s)) behaves approximately as [ S ] [ S N C + C k = N C k ]. (34) k= S 2 +,k = k= S 2 + We have seen in equations (2) that the variance of each scale is also asymptotically linear in the length N. Thus, as N increases, the correlation Cor(S (s), S 2 (s)) rapidly converges to a constant given by: S k= S 2 + C k(s S 2 ) [( S )( k= S + C S2 )] /2. (35) k(s ) k= S 2 + C k(s 2 ) In checking calculations on DNA scales (or other alphabets) that are invariant under the reverse complement operation, it is worth noticing that with a uniform distribution on the alphabet (p A = p C = p G = p T =.25), the correlations are symmetric. That is, for any < k < S we have C k (S, S 2 ) = C k (S, S 2 ). This results immediately from the fact that the sum of the terms S 2 (X...X S2 ) S (X...X S ) and S 2 ( X S2... X ) S ( X S2... X S2 S +) is equal to the sum of the terms S 2 (X...X S2 ) S (X S2 S +...X S2 ) and S 2 ( X S2... X ) S ( X S... X ), and similarly for other degrees of overlaps. The terms in the sums can be identically paired using the fact that S and S 2 are assumed to be reverse-complement invariant. The result is not true if the scales, or the distribution, are not reverse-complement invariant. Results DNA repeat equivalence classes We wrote a program that cycles through all possible DNA sequences of length P counting and listing all the classes that are equivalent under circular permutation and reverse complement operations. Because of this equivalence, in the case of scales that are reverse-complement invariant, it is sufficient to study the repeats of one representative member of each class. We ran the program up to length P = 2. The results, shown in Table, are in complete agreement with equations (9) (2). In Tables 2, 3 and 4 we list alphabetically all the members of each equivalence class for sequences of length 2 4. When P = 4, for instance, one finds 39 classes: 26 classes with 8 elements, 8 classes with 4 elements, and 4 classes with 2 elements. Only 33 classes are new, in the sense that 6 classes are derived from patterns already encountered at P = and 2. Likewise, when P > 2 is a prime number, the total number of classes is given by: 4 P 4 2P + 2 (36) 872

9 Sequence analysis by additive scales Table. Number of repeat unit equivalence classes. New or proper classes are classes that do not contain a shorter periodic pattern Sequence length Classes (total) Classes (new) Table 2. Dinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally. Classes and 5 are not proper dinucleotide classes Class number List of members (alphabetical order) AA TT 2 AC CA GT TG 3 AG CT GA TC 4 AT TA 5 CC GG 6 CG GC Table 3. Trinucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally A C Class number List of members (alphabetical order) AAA TTT 2 AAC ACA CAA GTT TGT TTG 3 AAG AGA CTT GAA TCT TTC 4 AAT ATA ATT TAA TAT TTA 5 ACC CAC CCA GGT GTG TGG 6 ACG CGA CGT GAC GTC TCG 7 ACT AGT CTA GTA TAC TAG 8 AGC CAG CTG GCA GCT TGC 9 AGG CCT CTC GAG GGA TCC ATC ATG CAT GAT TCA TGA CCC GGG 2 CCG CGC CGG GCC GCG GGC -8. G T Fig.. Dinucleotide prefix automata for the propeller twist angle scale. The CAG repeat, for instance, is associated with the cycle C A G C in the graph and has a total propeller twist value of = The corresponding reverse complement cycle is given by C T G C. The triplet repeat class with the largest propeller twist value is CCC followed by CCG. with two classes of size 2 associated with poly-a and poly- C, while all the remaining classes are new and contain 2P members. In the appendix, we provide tables in alphabetical order that allow to invert Tables 3 and 4, i.e. to find the class associated with any given P-tuple ( P = 3, 4). Analysis of DNA repeats by dinucleotide scales In the case of dinucleotide scales, the prefix automata contains four nodes (Figure ). Each DNA sequence is associated with a path through the corresponding graph, and exact repeats are associated with cycles. All paths, including cycles, of length greater than four are composite in the sense that they contain a cycle of length 4 or less. In Table 5, we list the dinucleotide scale values S(X X 2 ) + S(X 2 X ) for the six equivalence classes associated with all 6 possible dinucleotide repeats of the form (X X 2 ) R. For each scale, we list classes (represented by their first alphabetical member) and the corresponding scale value, in decreasing value order. The highest level of base stacking energy is achieved by the AT repeat class (.39) and the lowest by the CG repeat class ( 24.28). The ranking of all possible dinucleotide repeats induced by the propeller twist and the protein deformability scales are identical with the exception of an inversion between the CC ( 6.22 and 2.2) and CG ( 2. and 6.) classes at the high (flexible) end of the spectrum. At the 873

10 P.Baldi and P.-F.Baisnée Table 4. Tetranucleotide classes equivalent under circular permutation and reverse complement operations. Classes are numbered in alphabetical order vertically. Class members are listed in alphabetical order horizontally Class number List of members (alphabetical order) AAAA TTTT 2 AAAC AACA ACAA CAAA GTTT TGTT TTGT TTTG 3 AAAG AAGA AGAA CTTT GAAA TCTT TTCT TTTC 4 AAAT AATA ATAA ATTT TAAA TATT TTAT TTTA 5 AACC ACCA CAAC CCAA GGTT GTTG TGGT TTGG 6 AACG ACGA CGAA CGTT GAAC GTTC TCGT TTCG 7 AACT ACTA AGTT CTAA GTTA TAAC TAGT TTAG 8 AAGC AGCA CAAG CTTG GCAA GCTT TGCT TTGC 9 AAGG AGGA CCTT CTTC GAAG GGAA TCCT TTCC AAGT ACTT AGTA CTTA GTAA TAAG TACT TTAC AATC ATCA ATTG CAAT GATT TCAA TGAT TTGA 2 AATG ATGA ATTC CATT GAAT TCAT TGAA TTCA 3 AATT ATTA TAAT TTAA 4 ACAC CACA GTGT TGTG 5 ACAG AGAC CAGA CTGT GACA GTCT TCTG TGTC 6 ACAT ATAC ATGT CATA GTAT TACA TATG TGTA 7 ACCC CACC CCAC CCCA GGGT GGTG GTGG TGGG 8 ACCG CCGA CGAC CGGT GACC GGTC GTCG TCGG 9 ACCT AGGT CCTA CTAC GGTA GTAG TACC TAGG 2 ACGC CACG CGCA CGTG GCAC GCGT GTGC TGCG 2 ACGG CCGT CGGA CGTC GACG GGAC GTCC TCCG 22 ACGT CGTA GTAC TACG 23 ACTC AGTG CACT CTCA GAGT GTGA TCAC TGAG 24 ACTG AGTC CAGT CTGA GACT GTCA TCAG TGAC 25 AGAG CTCT GAGA TCTC 26 AGAT ATAG ATCT CTAT GATA TAGA TATC TCTA 27 AGCC CAGC CCAG CTGG GCCA GCTG GGCT TGGC 28 AGCG CGAG CGCT CTCG GAGC GCGA GCTC TCGC 29 AGCT CTAG GCTA TAGC 3 AGGC CAGG CCTG CTGC GCAG GCCT GGCA TGCC 3 AGGG CCCT CCTC CTCC GAGG GGAG GGGA TCCC 32 ATAT TATA 33 ATCC ATGG CATC CCAT GATG GGAT TCCA TGGA 34 ATCG CGAT GATC TCGA 35 ATGC CATG GCAT TGCA 36 CCCC GGGG 37 CCCG CCGC CGCC CGGG GCCC GCGG GGCG GGGC 38 CCGG CGGC GCCG GGCC 39 CGCG GCGC opposite (stiff) end, we find the single letter repeat class AA ( 37.2 and 5.8) followed by the proper dinucleotide repeat class AG ( and 6.6). In Table 6, we list the dinucleotide scale values for the 2 equivalence classes associated with all possible triplet repeats of the form (XYZ) R. In this special case, we find the results of Baldi et al. (999). The high and low ends of the base stacking energy scale are occupied by the triplet classes AAT ( 5.76) and CCG ( 32.54) respectively. We find again a high degree of correlation between the propeller twist and protein deformability scales. If we exclude the classes AAA/TTT ( 55.98) and CCC/GGG ( 24.33), which are not proper triplet repeat classes, then the maximum and the minimum of the propeller twist spectrum are respectively occupied by the classes CCG ( 29.22) and AAG ( 46.4). A similar ranking with the same extremal triplets is observed with the protein deformability scale: CCG (22.2) occupies the high end, whereas AAA (8.7) and AAG (9.5) occupy the low end of the spectrum. When considering all three dinucleotide scales, three minima and two maxima are occupied by two of the three repeat classes known to be involved in triplet repeat expansion diseases, namely AAG and CCG. GAA triplet (in the AAG class) expansion is associated with Friedreich s ataxia (Orr et al., 993; Campuzano et al., 874

11 Sequence analysis by additive scales Table 5. Dinucleotide structural scale values for repeat unit p = X X 2 with P = 2. S(p) = S(X X 2 ) + S(X 2 X ) Base Propeller Protein stacking twist deformability AT (.39) CC ( 6.22) CG (6.) AA (.74) CG ( 2.) CC (2.2) CC ( 6.52) AC ( 22.55) AC (2.) AG ( 6.59) AT ( 26.86) AT (7.9) AC ( 7.8) AG ( 27.48) AG (6.6) CG ( 24.28) AA ( 37.32) AA (5.8) Table 6. Dinucleotide structural scale values for repeat unit p = X X 2 X 3 with P = 3. S(p) = S(X X 2 ) + S(X 2 X 3 ) + S(X 3 X ). Repeat classes associated with triplet repeat expansion diseases are in bold Base Propeller Protein stacking twist deformability AAT ( 5.76) CCC ( 24.33) CCG (22.2) AAA ( 6.) CCG ( 29.22) ACG (8.9) ACT ( 2.) ACC ( 3.66) CCC (8.3) AAG ( 2.96) AGC ( 34.53) ACC (8.2) AAC ( 22.45) AGG ( 35.59) AGC (5.9) ATC ( 22.95) ACG ( 36.6) ATC (5.9) CCC ( 24.78) ATC ( 37.94) AAC (5.) AGG ( 24.85) ACT ( 38.95) AGG (2.7) ACC ( 25.34) AAC ( 4.2) AAT (.8) AGC ( 27.94) AAT ( 45.52) ACT (.7) ACG ( 3.) AAG ( 46.4) AAG (9.5) CCG ( 32.54) AAA ( 55.98) AAA (8.7) 996; Junck and Fink, 996; Paulson et al., 997; Koenig, 998; Lee, 998; Orr and Zoghbi, 998; Paulson, 998; Pulst, 998; Stevanin et al., 998). Abnormal GCC triplet (in the CCG class) expansion is associated with FRAXE mental retardation and abnormal expansion of the CGG triplet with fragile X syndrome (FRAXA) (Nelson, 995; Gusella and MacDonald, 996; Eichler and Nelson, 998; Skinner et al., 998; Gecz and Mulley, 999). The third triplet expansion disease related class, AGC, has average rank in all dinucleotide scales. In Table 7, we list the scale values for the 39 equivalence classes associated with all possible tetranucleotide repeats of the form (X X 2 X 3 X 4 ) R. The maximum of the base stacking scale is occupied by the dinucleotide repeat ATAT ( 2.78) and the proper tetranucleotide repeat AAAT ( 2.3). The minimum corresponds to CGCG ( 48.56) followed by ACGC ( 4.36). We again observe a substantial positive correlation between the values produced by the propeller twist and protein deformability scales together with a weaker negative correlation with respect to the base stacking energy scale. The high end of the propeller twist scale is occupied by CCCC Table 7. Dinucleotide structural scale values for repeat unit p = X X 2 X 3 X 4 with P = 4. S(p) = S(X X 2 ) + S(X 2 X 3 ) + S(X 3 X 4 ) + S(X 4 X ) Base Propeller Protein stacking twist deformability ATAT ( 2.78) CCCC ( 32.44) CGCG (32.2) AAAT ( 2.3) CCCG ( 37.33) CCCG (28.3) AATT ( 2.3) CCGG ( 37.33) CCGG (28.3) AAAA ( 2.48) ACCC ( 38.77) ACGC (28.2) AACT ( 26.48) CGCG ( 42.22) ATGC (25.2) AAGT ( 26.48) AGCC ( 42.64) ACCG (25.) AGAT ( 26.98) AGGC ( 42.64) ACGG (25.) AAAG ( 27.33) ACGC ( 43.66) CCCC (24.4) ACAT ( 27.47) AGGG ( 43.7) ACCC (24.3) AAAC ( 27.82) ACCG ( 44.72) ACAC (24.2) AATC ( 28.32) ACGG ( 44.72) ACGT (23.) AATG ( 28.32) ATGC ( 44.99) AGCG (22.7) ACCT ( 29.37) ACAC ( 45.) ATCG (22.7) AAGG ( 3.22) ATCC ( 46.5) AGCC (22.) AACC ( 3.7) ACCT ( 47.6) AGGC (22.) ATCC ( 3.2) ACGT ( 48.8) ATCC (22.) AGCT ( 3.97) AGCG ( 48.59) AACG (2.8) CCCC ( 33.4) AACC ( 49.32) AACC (2.) AGGG ( 33.) ACAT ( 49.4) ACAT (2.) AGAG ( 33.8) ACAG ( 5.3) AAGC (8.8) AAGC ( 33.3) ACTC ( 5.3) AATC (8.8) ACCC ( 33.6) ACTG ( 5.3) AATG (8.8) ACAG ( 33.67) AGCT ( 5.93) AGGG (8.8) ACTC ( 33.67) ATCG ( 52.) ACAG (8.7) ACTG ( 33.67) AAGC ( 53.9) ACTC (8.7) ACAC ( 34.6) ATAT ( 53.72) ACTG (8.7) ATGC ( 34.3) AAGG ( 54.25) AAAC (7.9) ACGT ( 34.53) AGAT ( 54.34) ACCT (6.8) AACG ( 35.38) AGAG ( 54.96) ATAT (5.8) ATCG ( 35.88) AACG ( 55.27) AAGG (5.6) AGCC ( 36.2) AATC ( 56.6) AGAT (4.5) AGGC ( 36.2) AATG ( 56.6) AGCT (4.5) ACCG ( 38.27) AACT ( 57.6) AAAT (3.7) ACGG ( 38.27) AAGT ( 57.6) AATT (3.7) CCCG ( 4.8) AAAC ( 59.87) AACT (3.6) CCGG ( 4.8) AAAT ( 64.8) AAGT (3.6) AGCG ( 4.87) AATT ( 64.8) AGAG (3.2) ACGC ( 4.36) AAAG ( 64.8) AAAG (2.4) CGCG ( 48.56) AAAA ( 74.64) AAAA (.6) ( 32.44) and CCCG ( 37.33) while that of the protein deformability scale is occupied by CGCG (32.2) and CCCG (28.3). The lowest values correspond for both scales to AAAA ( and.6) and AAAG ( 64.8 and 2.4). All repeat units of length greater than 4 are made up of shorter cyclic paths in the prefix automata and therefore their properties can essentially be predicted from the previous three tables. For all lengths, for instance, the highest level of base stacking energy is achieved by the class ATATATAT... when P is even, and by the class AATATATAT... when P is odd. The lowest level by the 875

12 P.Baldi and P.-F.Baisnée TT AA AC Table 8. Trinucleotide structural scale values for repeat unit p = X X 2 with P = 2. S(p) = S(X X 2 X ) + S(X 2 X X 2 ) TG AG Bendability Position preference TA TC.7.75 AT CA AT (.364) AA (72) AG (.58) CG (5) AC (.34) AT (26) CC (.24) CC (26) CG (.54) AC (23) AA (.548) AG (7).76 GT GG GC GA Fig. 2. Trinucleotide prefix automata for the bendability scale. Circle is used for ease of display but does not represent actual connections. The CAG repeat, for instance, is associated with the cycle CA AG GC CA in the graph and has a total bendability value of =.268. It is the highest bendability value for any triplet repeat. Other edges are not shown. class CGCGCG... when P is even, and CCGCGCG... when P is odd. For protein deformability, the maximal level is achieved by the class CGCGCG... when P is even, and by CCGCGCG... when P is odd. The lowest level is associated with poly-a (i.e. (A) P ). Poly-C and poly-a give also the absolute highest and lowest propeller twist angles at all lengths. Analysis of DNA repeats by trinucleotide scales In the case of trinucleotide scales, the prefix automata contains 6 nodes (Figure 2), each one labeled with a different dinucleotide. All paths, including cycles, of length greater than 6 are composite, i.e. contain at least one cycle of length 6 or less. The trinucleotide scale values for all repeats with periodic unit length P = 2 are given in Table 8. The highest level of bendability is achieved by AT (.364) and the lowest by AA (.548) and CG (.54). The highest level of position preference is achieved by AA (72) and CG (5), and the lowest by AG (7). The trinucleotide scale values for all repeats with periodic unit length P = 3 are given in Table 9 (see also Baldi et al., 999). The highest level of bendability is achieved by the class AGC (.268) and the lowest by AAA (.822) and ACC(.238). In fact only two classes of CT CG CC repeats (AGC and ATC) have positive bendability and are well separated from the rest. The highest level of position preference is achieved by the class AAA (8) followed by CCG (72), and the lowest by AGG and ACC (2). The class AGC, which contains the CAG repeat responsible for the majority of the known triplet repeat expansion diseases, has the highest bendability. It is the only repeat class for which all three shifted triplets have a high individual bendability. Moreover, this class has a relatively low position preference value, another sign of flexibility. Therefore one can hypothesize that long CAG repeats correspond to stretches of DNA that are highly flexible in all positions. Consistently with their high flexibility, CAG/CTG repeats have been found to have the highest affinity for histones among all possible triplet repeats (Wang and Griffith, 994, 995; Godde and Wolffe, 996). Other DNA sequences can adopt long range curvature only if they contain highly flexible triplets in phase with the helical pitch (roughly every.5 bp). The flexibility of extended CAG repeats has been verified experimentally (Chastain and Sinden, 998). The CCG class, which contains the disease-related triplets CGG and GCC, is found at the high (rigid) end of the position preference scale (72), exceeded only by poly-a. This class is also stiff according to the bendability scale (.6). This is consistent with the fact that CGG/CCG repeats seem completely unable to form nucleosomes (Wang et al., 996; Godde et al., 996). The AAG class, which contains the disease related triplet GAA, occupies the lower (flexible) end of the position preference scale (27). It is the second lowest considering that the last two classes have the same value (2). We also note that AAA/TTT is by large the stiffest of all possible repeats according to both scales. Such homopolymeric tracts are known from X-ray crystallography to be rigid and straight (Nelson et al., 987) and they are bad candidates for nucleosome positioning. In fact, a number of promoters in yeast contain homopolymeric da:dt elements. Studies in two different yeast species have shown that the homopolymeric elements destabilize nucleosomes and thereby facilitate 876

Likelihood-Based Phylogenetic Inference

Likelihood-Based Phylogenetic Inference John P. Huelsenbeck (UC Berkeley) #NEXUS begin data; dimensions ntax=5 nchar=895; format gap=- datatype=dna; matrix Human AAGCTTCACCGGCGCAGTCATTCTCATAATCGCCCACGGACTT...AACCCAAACAACCCAGCTCTCCCTAAGCTT