Complexity of Biomolecular Sequences

Size: px

Start display at page:

Download "Complexity of Biomolecular Sequences"

Ruth Lamb
5 years ago
Views:

1 Complexity of Biomolecular Sequences Institute of Signal Processing Tampere University of Technology Tampere University of Technology Page 1

2 Outline ➀ ➁ ➂ ➃ ➄ ➅ ➆ Introduction Biological Preliminaries Compression Preliminaries The Biocompress Program NML Model for Discrete Regression The GeNML Algorithm Results Tampere University of Technology Page 2

3 Introduction Complete DNA sequences of many organisms are known, and their number is rapidly increasing. The sequences are huge E. coli: 4,639,221 bp H. sapiens: 3 billion bp 4 letter alphabet 2 bits/bp Computational aid is necessary for processing these sequences Tampere University of Technology Page 3

4 Introduction 2 The objective of computational investigations is twofold: Compress the DNA sequences Cut down storage space and transmission costs Algorithm complexity is critical Model the statistical properties of the data Find patterns and structure within them Closely related to compression Algorithm complexity is not so important DNA compression has paramount importance in studying biomolecular sequences Tampere University of Technology Page 4

5 Introduction 3 Biological studies reveal important statistical information: DNA sequences contain many approximate tandem repeats Many essential genes have many copies There are only about 1,000 basic protein folding patterns Genes duplicate themselves for evolutionary purposes These suggest that DNA sequences are well compressible Tampere University of Technology Page 5

6 Introduction 4 Unfortunately, other known facts hamper efficient compression: Regularities are often blurred by Random mutation Translocation Crossover Reversal events Sequencing errors Only about 10% of a sequence contains genes The rest is considered to be non-coding Conclusion: compression of DNA is difficult! Tampere University of Technology Page 6

7 Biological Preliminaries Important facts about DNA sequences from biological studies: Sequences have 3D structure Can be reversibly unfolded into a string of symbols Four kinds of nucleotides: A, C, G, T (U) There are links between complementary bases: A T, C G Some pairs of complementary subsequences are mapped together Such pairs of subsequences are called palindromes Tampere University of Technology Page 7

8 Biological Preliminaries 2 Figure 1: Secondary structure of an RNA sequence Tampere University of Technology Page 8

9 Compression Preliminaries Further observations about DNA sequences: ➀ Repetitions are a Very sparse b Relatively long c Roughly half the time palindromes ➁ Long sequences often can be matched approximately by a Deletions b Substitutions c Insertions ➂ Repetitions can occur far from each other ➃ Contextual correlation is not too significant Tampere University of Technology Page 9

10 Compression Preliminaries 2 Problems with existing technologies: General purpose coders are not good because of ➀c, ➁, ➂, ➃ PPM and its derivatives are not good because of ➀a, ➀b, ➁ BWT is not good because of ➀a, ➀c, ➁ Substitution-based methods have difficulties with ➁, ➃ Observation: among these algorithm classes, substitution-based methods offer the best performance on DNA sequences. Substitutional methods are popular in DNA encoders. Tampere University of Technology Page 10

11 Compression Preliminaries 3 The pair of factors f and f 1 α is called a palindrome if f denotes a sequence a 1, a 2,..., a n f 1 is the sequence in inverse order: a n, a n 1,..., a 1 f α is complementing each character in the sequence A T C G E.g. if f =AAACGT, then fα 1 =ACGTTT. Tampere University of Technology Page 11

12 Compression Preliminaries 4 We can measure the complexity of DNA sequences by DNA compression: Provide a compact representation of a DNA sequence from which the exact replica of the original can be restored. Practical considerations are important Running time Complexity (memory, code) We are not concerned with lossy compression Tampere University of Technology Page 12

13 Compression Preliminaries 5 Another approach for estimating the complexity is Entropy estimation: Provide a reliable entropy estimate that asymptotically converges to the actual entropy. Practical requirements are less significant Usability is less as well Entropy estimates are difficult to be justified Tampere University of Technology Page 13

14 The Biocompress Program S. Grumbach, F. Tahi, Compression of DNA sequences, Data Compression Conference 1993, DCC 93, pp , Encodes a text on the four letter alphabet Produces a binary sequence Follows the LZSS scenario Window has the size that of the input Substitutes factors with earlier occurrences Occurrences are either identical or palindrome References are shorter than factors they refer to Tampere University of Technology Page 14

15 The Biocompress Program 2 Representing numbers Sequence lengths: reversed Fibonacci number followed by 1 Match lengths: Matches shorter than 7 are discarded Matches between 7 and 38 are written in 5 bits Matches beyond 38: followed by Fibonacci of remainder Match positions: the shorter of binary or Fibonacci form Tampere University of Technology Page 15

16 The Biocompress Program 3 S. Grumbach, F. Tahi, A new challenge for compression algorithms: Genetic sequences, J. Inform. Process. Manage., vol. 30, no. 6, pp , An improved version of Biocompress is Biocompress-2: Literal and LZSS coding remain the same as in Biocompress Order-2 context coder with arithmetic coding has been added Best of the three methods codes a small prefix of the input Tampere University of Technology Page 16

17 Discrete Regression Encode a vector using another known vector from a sample space Block-based decomposition Regressor is chosen from past, finite data For genomic data and a given regressor Bit mask of matching symbols is deemed to be output of a memoryless source Non-matching symbols take each other symbol with equal probability This decomposes problem to encoding bit patterns Tampere University of Technology Page 17

18 Discrete Regression 2 Tampere University of Technology Page 18

19 The NML Model for Discrete Regression I. Tabus, G. Korodi, J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for discrete regression, Data Compression Conference 2003, DCC 03, pp , Practical encoder with lightweight requirements Effectively combines NML, context, clear, RLE plus AE Instances of NML and context are used with different parameters Best of the methods codes a small prefix of the input Tampere University of Technology Page 19

20 The NML Model for Discrete Regression 2 Objective: encode a sequence y n based on a known discrete regressor sequence x n Choose a parametric probability model P (y n x n ; Θ) Obtain the maximized likelihood P (y n x n ; ˆΘ(y n, x n )) Obtain the universal NML model ˆP (y n x n ) by normalization of the maximized likelihood Using this model y n is encoded by an arithmetic coder x n is chosen so that the Hamming-distance between x n and y n is minimized. Tampere University of Technology Page 20

21 The NML Model for Discrete Regression 3 Benefits: Inherently suitable for ➀, ➁b, ➂ Practical encoders are feasible with low complexity Drawbacks: Cannot efficiently handle ➁a, ➁c Some remedy is provided by block-based decomposition Not so optimal with ➀b Solution: run-length coding is added Difficulties with ➃ Solution: a low-order context coder is added Tampere University of Technology Page 21

22 The NML Model for Discrete Regression 4 Denoting the block to be encoded by y n and the regressor block by x n, P (y i x i ; θ) = θ if y i = x i ψ if y i x i, with ψ = 1 θ. We extend this model to blocks as M 1 P (y n x n ; θ) = θ n 1 i=0 χ(y i=x i ) ψ n 1 i=0 χ(y i x i ) = θ n m ψ n n m. Tampere University of Technology Page 22

23 The NML Model for Discrete Regression 5 Since ˆθ(y n, x n ) = n m n, the maximized likelihood is P (y n x n ; ˆθ(y n, x n )) = ( n mn ) nm ( n nm n(m 1)) n nm. For normalization use blocks similar enough to x n : ˆP (y n x n ) = m Λn (M 1)n m ( n m n ) ( n m n nm n m n(m 1) ) n n m ( m n ) m( n m n(m 1) ) n m. Tampere University of Technology Page 23

24 The NML Model for Discrete Regression 6 The set Λ n = {N(w, n)..., n} are computed from N(w, n) = min{n m L w NML (n m, N(w, n)) + log 2 w + 1 < 2n}. Introducing C n,n = m N ( n m ) ( m ) m ( n m ) n m n n, the NML code length is L NML (n m, N) = log 2 C n,n n m log 2 nm n (n n m ) log 2 n nm n + (n n m ) log 2 (M 1). Tampere University of Technology Page 24

25 The NML Model for Discrete Regression Unconstrained normalization Constrained normalization Clear representation NML code length Number of matching bases n m Figure 2: NML code length when Y contains all possible blocks (unconstrained normalization) and when Y x n contains only the blocks with a number of correct matches larger than N = 30, for n = 48. Tampere University of Technology Page 25

26 The NML Model for Discrete Regression 8 Encoding a block y n with the NML model goes as ➀ Find the best regressor x n, ➁ Encode the position and direction (normal or palindrome) of x n, ➂ Encode the binary mask b n where b i = χ(y i = x i ), ➃ Correct the non-matching characters indicated by b n. Tampere University of Technology Page 26

27 The NML Model for Discrete Regression 9 For finding the best regressor, Both normal and palindrome matches are used Regressor is fully inside s max{ln n w+1,0},..., s ln n A contiguous string of 0 s of length k is conditioned in b n This requirement is used to speed up the search Increasing k is much faster, with little loss in compression Tampere University of Technology Page 27

28 The NML Model for Discrete Regression Search acceleration Exact formula A(n,r) Lower bound D(n,r) Seed length r Figure 3: Acceleration of the search using seeds of length r, when the block size is n = 48. Tampere University of Technology Page 28

29 The NML Model for Discrete Regression Relative decrease in compression ratio NML+clear GeNML Seed length r Figure 4: Relative reduction in compression ratio EL I 2 EL I1 EL I1 when using seeds of length r, against the case of exhaustive search, when the block size is n = 48. Circular marks show P (miss n m ) obtained on a random sequence, but with probabilities P (n m ) collected from the DNA sequence HUMGHCSA. Triangular marks show the change in performance of GeNML on the same file. Tampere University of Technology Page 29

30 The NML Model for Discrete Regression 12 Next, the position and direction of the best match is encoded Normally this takes up log 2 min{(l 1)n + 1, w} + 1 bits Long approximate matches are efficiently coded with match prediction Long exact matches are coded with run lengths Tampere University of Technology Page 30

31 The NML Model for Discrete Regression 13 The binary mask b n can be encoded in two steps ➀ The number of matching bases n m is encoded according to P (n m ) = b n i b i =n m ˆP (b n ) = ( n n m ) ( n m n ) n m ( n nm n C n,n ) n n m ➁ The binary mask b n is encoded bit-wise with the distribution P (b k = 0) = n k n(k) n k, P (b k = 1) = n(k) n k, where n(k) = n 1 j=k b j. Tampere University of Technology Page 31

32 The NML Model for Discrete Regression 14 The overall code length for NML at position ln, window size w is L 1 (y n ) = log 2 C n,n n m log 2 n mn (n n m ) log 2 n n m n + (n n m ) log 2 (M 1) + log 2 min{(l 1)n + 1, w} + 1. Tampere University of Technology Page 32

33 The GeNML Algorithm and Ioan Tabus, An efficient normalized maximum likelihood algorithm for DNA sequence compression, ACM Trans. on Information Systems, vol. 23, no. 1, pp. 3 34, Improved compression efficiency Practical resource requirements Complex model of several algorithms each with several instances Tampere University of Technology Page 33

34 The GeNML Algorithm 2 Context coder serves as an auxiliary coder that presumably complements NML in performance NML cannot capture redundancy concentrating in a small area For these blocks order-1 context coder is used Parameters for this coder are set as η(a k a j ) = n (a k a j ) a n (a a j ) The overall code length for the context coder is L 2 (y n ) = n i=1 log 2 η(y i y i 1 ). Tampere University of Technology Page 34

35 The GeNML Algorithm 3 DNA data are often statistically random appearing Since these parts cannot be compressed, clear encoding is used The code length for the clear representation is L 3 (y n ) = 2n. Tampere University of Technology Page 35

36 The GeNML Algorithm 4 The GeNML algorithm is outlined in the followings: DNA sequence is split into macroblocks One instance of NML, context and clear coder forms a group Objects in different groups have different parameters The best group is selected to compress the next macroblock Inside the macroblock, compression is done block-wise The best algorithm of the group compresses the next block Tampere University of Technology Page 36

37 The GeNML Algorithm 5 Step 1. Set parameters n 0, H 0, δ, C. Let m = δ C 1 n 0. Step 2. For each macroblock M k = s km,..., s (k+1)m 1 Step 2.1 Let n = n 0, H = H 0. Step 2.2 Step 2.3 Let L n = 0. For each block y n in M k Step Compute L 1, L 2, L 3. Step Let L n = L n + min{l 1, L 2, L 3 } Step 2.4 If n < m Step 2.5 Step 2.6 then let n = δ n, H = H/δ, and go to Step 2.2. else proceed to next Step. Find n b for which L nb = min n {L n }, let H b = nh n b. Signal n b in the compressed stream. For each block y n b in M k Step Repeat Step Step Signal the best algorithm found in the previous Step and encode y n b. Figure 5: The specification of the GeNML algorithm. Tampere University of Technology Page 37

38 The GeNML Algorithm Clear Context NML Best Predicted Predicted=Best (a) (b) Figure 6: (a) The number of times the clear representation, order-1 context and NML with n = 48 prove to be the best. (b) The number of times the best match is used, the predicted match is used, and the best match is predicted. Tampere University of Technology Page 38

39 Results Sequence Size Bio2 Gen2 CTW DNA GeNML CHMPXX CHNTXX HEHCMVCG HUMDYSTROP HUMGHCSA HUMHDABCD HUMHPRTB MPOMTCG MTPACG VACCG Table 1: Comparison of the compression (in bits per base) obtained by algorithms Biocompress-2 (Bio2), GenCompress-2 (Gen2), CTW-LZ (CTW), DNACompress (DNA) and GeNML (GeNML). Tampere University of Technology Page 39

40 Results 2 Sequence Size Cfact GenCompress-2 GeNML Atatsgs Atef1a Atrdnaf Atrdnai Celk07e HSG6PDGEN Mmzp3g Xlxfg Table 2: Comparison of the compression (in bits per base) obtained by the algorithms Cfact, GenCompress-2 and GeNML. Tampere University of Technology Page 40

41 Human Genome Compression The program GeNML is suitable for compressing the entire Human Genome Original size: 3,070,521,116 bases 732 Mbytes Number of specified bases: 2,832,183, Mbytes GeNML window size: 1 Mbyte GeNML seed length: 8 Compressed size: 589,323,192 bytes Mbytes Compression ratio for specified bases: bpb Tampere University of Technology Page 41

Sequence comparison by compression

Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.