Complexity of Biomolecular Sequences
|
|
- Ruth Lamb
- 5 years ago
- Views:
Transcription
1 Complexity of Biomolecular Sequences Institute of Signal Processing Tampere University of Technology Tampere University of Technology Page 1
2 Outline ➀ ➁ ➂ ➃ ➄ ➅ ➆ Introduction Biological Preliminaries Compression Preliminaries The Biocompress Program NML Model for Discrete Regression The GeNML Algorithm Results Tampere University of Technology Page 2
3 Introduction Complete DNA sequences of many organisms are known, and their number is rapidly increasing. The sequences are huge E. coli: 4,639,221 bp H. sapiens: 3 billion bp 4 letter alphabet 2 bits/bp Computational aid is necessary for processing these sequences Tampere University of Technology Page 3
4 Introduction 2 The objective of computational investigations is twofold: Compress the DNA sequences Cut down storage space and transmission costs Algorithm complexity is critical Model the statistical properties of the data Find patterns and structure within them Closely related to compression Algorithm complexity is not so important DNA compression has paramount importance in studying biomolecular sequences Tampere University of Technology Page 4
5 Introduction 3 Biological studies reveal important statistical information: DNA sequences contain many approximate tandem repeats Many essential genes have many copies There are only about 1,000 basic protein folding patterns Genes duplicate themselves for evolutionary purposes These suggest that DNA sequences are well compressible Tampere University of Technology Page 5
6 Introduction 4 Unfortunately, other known facts hamper efficient compression: Regularities are often blurred by Random mutation Translocation Crossover Reversal events Sequencing errors Only about 10% of a sequence contains genes The rest is considered to be non-coding Conclusion: compression of DNA is difficult! Tampere University of Technology Page 6
7 Biological Preliminaries Important facts about DNA sequences from biological studies: Sequences have 3D structure Can be reversibly unfolded into a string of symbols Four kinds of nucleotides: A, C, G, T (U) There are links between complementary bases: A T, C G Some pairs of complementary subsequences are mapped together Such pairs of subsequences are called palindromes Tampere University of Technology Page 7
8 Biological Preliminaries 2 Figure 1: Secondary structure of an RNA sequence Tampere University of Technology Page 8
9 Compression Preliminaries Further observations about DNA sequences: ➀ Repetitions are a Very sparse b Relatively long c Roughly half the time palindromes ➁ Long sequences often can be matched approximately by a Deletions b Substitutions c Insertions ➂ Repetitions can occur far from each other ➃ Contextual correlation is not too significant Tampere University of Technology Page 9
10 Compression Preliminaries 2 Problems with existing technologies: General purpose coders are not good because of ➀c, ➁, ➂, ➃ PPM and its derivatives are not good because of ➀a, ➀b, ➁ BWT is not good because of ➀a, ➀c, ➁ Substitution-based methods have difficulties with ➁, ➃ Observation: among these algorithm classes, substitution-based methods offer the best performance on DNA sequences. Substitutional methods are popular in DNA encoders. Tampere University of Technology Page 10
11 Compression Preliminaries 3 The pair of factors f and f 1 α is called a palindrome if f denotes a sequence a 1, a 2,..., a n f 1 is the sequence in inverse order: a n, a n 1,..., a 1 f α is complementing each character in the sequence A T C G E.g. if f =AAACGT, then fα 1 =ACGTTT. Tampere University of Technology Page 11
12 Compression Preliminaries 4 We can measure the complexity of DNA sequences by DNA compression: Provide a compact representation of a DNA sequence from which the exact replica of the original can be restored. Practical considerations are important Running time Complexity (memory, code) We are not concerned with lossy compression Tampere University of Technology Page 12
13 Compression Preliminaries 5 Another approach for estimating the complexity is Entropy estimation: Provide a reliable entropy estimate that asymptotically converges to the actual entropy. Practical requirements are less significant Usability is less as well Entropy estimates are difficult to be justified Tampere University of Technology Page 13
14 The Biocompress Program S. Grumbach, F. Tahi, Compression of DNA sequences, Data Compression Conference 1993, DCC 93, pp , Encodes a text on the four letter alphabet Produces a binary sequence Follows the LZSS scenario Window has the size that of the input Substitutes factors with earlier occurrences Occurrences are either identical or palindrome References are shorter than factors they refer to Tampere University of Technology Page 14
15 The Biocompress Program 2 Representing numbers Sequence lengths: reversed Fibonacci number followed by 1 Match lengths: Matches shorter than 7 are discarded Matches between 7 and 38 are written in 5 bits Matches beyond 38: followed by Fibonacci of remainder Match positions: the shorter of binary or Fibonacci form Tampere University of Technology Page 15
16 The Biocompress Program 3 S. Grumbach, F. Tahi, A new challenge for compression algorithms: Genetic sequences, J. Inform. Process. Manage., vol. 30, no. 6, pp , An improved version of Biocompress is Biocompress-2: Literal and LZSS coding remain the same as in Biocompress Order-2 context coder with arithmetic coding has been added Best of the three methods codes a small prefix of the input Tampere University of Technology Page 16
17 Discrete Regression Encode a vector using another known vector from a sample space Block-based decomposition Regressor is chosen from past, finite data For genomic data and a given regressor Bit mask of matching symbols is deemed to be output of a memoryless source Non-matching symbols take each other symbol with equal probability This decomposes problem to encoding bit patterns Tampere University of Technology Page 17
18 Discrete Regression 2 Tampere University of Technology Page 18
19 The NML Model for Discrete Regression I. Tabus, G. Korodi, J. Rissanen, DNA sequence compression using the normalized maximum likelihood model for discrete regression, Data Compression Conference 2003, DCC 03, pp , Practical encoder with lightweight requirements Effectively combines NML, context, clear, RLE plus AE Instances of NML and context are used with different parameters Best of the methods codes a small prefix of the input Tampere University of Technology Page 19
20 The NML Model for Discrete Regression 2 Objective: encode a sequence y n based on a known discrete regressor sequence x n Choose a parametric probability model P (y n x n ; Θ) Obtain the maximized likelihood P (y n x n ; ˆΘ(y n, x n )) Obtain the universal NML model ˆP (y n x n ) by normalization of the maximized likelihood Using this model y n is encoded by an arithmetic coder x n is chosen so that the Hamming-distance between x n and y n is minimized. Tampere University of Technology Page 20
21 The NML Model for Discrete Regression 3 Benefits: Inherently suitable for ➀, ➁b, ➂ Practical encoders are feasible with low complexity Drawbacks: Cannot efficiently handle ➁a, ➁c Some remedy is provided by block-based decomposition Not so optimal with ➀b Solution: run-length coding is added Difficulties with ➃ Solution: a low-order context coder is added Tampere University of Technology Page 21
22 The NML Model for Discrete Regression 4 Denoting the block to be encoded by y n and the regressor block by x n, P (y i x i ; θ) = θ if y i = x i ψ if y i x i, with ψ = 1 θ. We extend this model to blocks as M 1 P (y n x n ; θ) = θ n 1 i=0 χ(y i=x i ) ψ n 1 i=0 χ(y i x i ) = θ n m ψ n n m. Tampere University of Technology Page 22
23 The NML Model for Discrete Regression 5 Since ˆθ(y n, x n ) = n m n, the maximized likelihood is P (y n x n ; ˆθ(y n, x n )) = ( n mn ) nm ( n nm n(m 1)) n nm. For normalization use blocks similar enough to x n : ˆP (y n x n ) = m Λn (M 1)n m ( n m n ) ( n m n nm n m n(m 1) ) n n m ( m n ) m( n m n(m 1) ) n m. Tampere University of Technology Page 23
24 The NML Model for Discrete Regression 6 The set Λ n = {N(w, n)..., n} are computed from N(w, n) = min{n m L w NML (n m, N(w, n)) + log 2 w + 1 < 2n}. Introducing C n,n = m N ( n m ) ( m ) m ( n m ) n m n n, the NML code length is L NML (n m, N) = log 2 C n,n n m log 2 nm n (n n m ) log 2 n nm n + (n n m ) log 2 (M 1). Tampere University of Technology Page 24
25 The NML Model for Discrete Regression Unconstrained normalization Constrained normalization Clear representation NML code length Number of matching bases n m Figure 2: NML code length when Y contains all possible blocks (unconstrained normalization) and when Y x n contains only the blocks with a number of correct matches larger than N = 30, for n = 48. Tampere University of Technology Page 25
26 The NML Model for Discrete Regression 8 Encoding a block y n with the NML model goes as ➀ Find the best regressor x n, ➁ Encode the position and direction (normal or palindrome) of x n, ➂ Encode the binary mask b n where b i = χ(y i = x i ), ➃ Correct the non-matching characters indicated by b n. Tampere University of Technology Page 26
27 The NML Model for Discrete Regression 9 For finding the best regressor, Both normal and palindrome matches are used Regressor is fully inside s max{ln n w+1,0},..., s ln n A contiguous string of 0 s of length k is conditioned in b n This requirement is used to speed up the search Increasing k is much faster, with little loss in compression Tampere University of Technology Page 27
28 The NML Model for Discrete Regression Search acceleration Exact formula A(n,r) Lower bound D(n,r) Seed length r Figure 3: Acceleration of the search using seeds of length r, when the block size is n = 48. Tampere University of Technology Page 28
29 The NML Model for Discrete Regression Relative decrease in compression ratio NML+clear GeNML Seed length r Figure 4: Relative reduction in compression ratio EL I 2 EL I1 EL I1 when using seeds of length r, against the case of exhaustive search, when the block size is n = 48. Circular marks show P (miss n m ) obtained on a random sequence, but with probabilities P (n m ) collected from the DNA sequence HUMGHCSA. Triangular marks show the change in performance of GeNML on the same file. Tampere University of Technology Page 29
30 The NML Model for Discrete Regression 12 Next, the position and direction of the best match is encoded Normally this takes up log 2 min{(l 1)n + 1, w} + 1 bits Long approximate matches are efficiently coded with match prediction Long exact matches are coded with run lengths Tampere University of Technology Page 30
31 The NML Model for Discrete Regression 13 The binary mask b n can be encoded in two steps ➀ The number of matching bases n m is encoded according to P (n m ) = b n i b i =n m ˆP (b n ) = ( n n m ) ( n m n ) n m ( n nm n C n,n ) n n m ➁ The binary mask b n is encoded bit-wise with the distribution P (b k = 0) = n k n(k) n k, P (b k = 1) = n(k) n k, where n(k) = n 1 j=k b j. Tampere University of Technology Page 31
32 The NML Model for Discrete Regression 14 The overall code length for NML at position ln, window size w is L 1 (y n ) = log 2 C n,n n m log 2 n mn (n n m ) log 2 n n m n + (n n m ) log 2 (M 1) + log 2 min{(l 1)n + 1, w} + 1. Tampere University of Technology Page 32
33 The GeNML Algorithm and Ioan Tabus, An efficient normalized maximum likelihood algorithm for DNA sequence compression, ACM Trans. on Information Systems, vol. 23, no. 1, pp. 3 34, Improved compression efficiency Practical resource requirements Complex model of several algorithms each with several instances Tampere University of Technology Page 33
34 The GeNML Algorithm 2 Context coder serves as an auxiliary coder that presumably complements NML in performance NML cannot capture redundancy concentrating in a small area For these blocks order-1 context coder is used Parameters for this coder are set as η(a k a j ) = n (a k a j ) a n (a a j ) The overall code length for the context coder is L 2 (y n ) = n i=1 log 2 η(y i y i 1 ). Tampere University of Technology Page 34
35 The GeNML Algorithm 3 DNA data are often statistically random appearing Since these parts cannot be compressed, clear encoding is used The code length for the clear representation is L 3 (y n ) = 2n. Tampere University of Technology Page 35
36 The GeNML Algorithm 4 The GeNML algorithm is outlined in the followings: DNA sequence is split into macroblocks One instance of NML, context and clear coder forms a group Objects in different groups have different parameters The best group is selected to compress the next macroblock Inside the macroblock, compression is done block-wise The best algorithm of the group compresses the next block Tampere University of Technology Page 36
37 The GeNML Algorithm 5 Step 1. Set parameters n 0, H 0, δ, C. Let m = δ C 1 n 0. Step 2. For each macroblock M k = s km,..., s (k+1)m 1 Step 2.1 Let n = n 0, H = H 0. Step 2.2 Step 2.3 Let L n = 0. For each block y n in M k Step Compute L 1, L 2, L 3. Step Let L n = L n + min{l 1, L 2, L 3 } Step 2.4 If n < m Step 2.5 Step 2.6 then let n = δ n, H = H/δ, and go to Step 2.2. else proceed to next Step. Find n b for which L nb = min n {L n }, let H b = nh n b. Signal n b in the compressed stream. For each block y n b in M k Step Repeat Step Step Signal the best algorithm found in the previous Step and encode y n b. Figure 5: The specification of the GeNML algorithm. Tampere University of Technology Page 37
38 The GeNML Algorithm Clear Context NML Best Predicted Predicted=Best (a) (b) Figure 6: (a) The number of times the clear representation, order-1 context and NML with n = 48 prove to be the best. (b) The number of times the best match is used, the predicted match is used, and the best match is predicted. Tampere University of Technology Page 38
39 Results Sequence Size Bio2 Gen2 CTW DNA GeNML CHMPXX CHNTXX HEHCMVCG HUMDYSTROP HUMGHCSA HUMHDABCD HUMHPRTB MPOMTCG MTPACG VACCG Table 1: Comparison of the compression (in bits per base) obtained by algorithms Biocompress-2 (Bio2), GenCompress-2 (Gen2), CTW-LZ (CTW), DNACompress (DNA) and GeNML (GeNML). Tampere University of Technology Page 39
40 Results 2 Sequence Size Cfact GenCompress-2 GeNML Atatsgs Atef1a Atrdnaf Atrdnai Celk07e HSG6PDGEN Mmzp3g Xlxfg Table 2: Comparison of the compression (in bits per base) obtained by the algorithms Cfact, GenCompress-2 and GeNML. Tampere University of Technology Page 40
41 Human Genome Compression The program GeNML is suitable for compressing the entire Human Genome Original size: 3,070,521,116 bases 732 Mbytes Number of specified bases: 2,832,183, Mbytes GeNML window size: 1 Mbyte GeNML seed length: 8 Compressed size: 589,323,192 bytes Mbytes Compression ratio for specified bases: bpb Tampere University of Technology Page 41
Sequence comparison by compression
Sequence comparison by compression Motivation similarity as a marker for homology. And homology is used to infer function. Sometimes, we are only interested in a numerical distance between two sequences.
More informationTechnical Report TR-INF UNIPMN. A Simple and Fast DNA Compressor
Università del Piemonte Orientale. Dipartimento di Informatica. http://www.di.unipmn.it 1 Technical Report TR-INF-2003-04-03-UNIPMN A Simple and Fast DNA Compressor Giovanni Manzini Marcella Rastero April
More informationSIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding
SIGNAL COMPRESSION Lecture 7 Variable to Fix Encoding 1. Tunstall codes 2. Petry codes 3. Generalized Tunstall codes for Markov sources (a presentation of the paper by I. Tabus, G. Korodi, J. Rissanen.
More informationEstimating DNA Sequence Entropy. No all of the sequence information that gets copied. result in a viable organism, therefore there are
Estimating DNA Sequence Entropy J. Kevin Lanctot University of Waterloo Ming Li y University of Waterloo En-hui Yang z University of Waterloo Abstract This paper presents the rst entropy estimator for
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationMACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance
MACFP: Maximal Approximate Consecutive Frequent Pattern Mining under Edit Distance Jingbo Shang, Jian Peng, Jiawei Han University of Illinois, Urbana-Champaign May 6, 2016 Presented by Jingbo Shang 2 Outline
More informationLecture 4 : Adaptive source coding algorithms
Lecture 4 : Adaptive source coding algorithms February 2, 28 Information Theory Outline 1. Motivation ; 2. adaptive Huffman encoding ; 3. Gallager and Knuth s method ; 4. Dictionary methods : Lempel-Ziv
More informationPhysical Layer and Coding
Physical Layer and Coding Muriel Médard Professor EECS Overview A variety of physical media: copper, free space, optical fiber Unified way of addressing signals at the input and the output of these media:
More informationAutumn Coping with NP-completeness (Conclusion) Introduction to Data Compression
Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression Kirkpatrick (984) Analogy from thermodynamics. The best crystals are found by annealing. First heat up the material to let
More information1 Introduction to information theory
1 Introduction to information theory 1.1 Introduction In this chapter we present some of the basic concepts of information theory. The situations we have in mind involve the exchange of information through
More informationUNIT I INFORMATION THEORY. I k log 2
UNIT I INFORMATION THEORY Claude Shannon 1916-2001 Creator of Information Theory, lays the foundation for implementing logic in digital circuits as part of his Masters Thesis! (1939) and published a paper
More informationRun-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE
General e Image Coder Structure Motion Video x(s 1,s 2,t) or x(s 1,s 2 ) Natural Image Sampling A form of data compression; usually lossless, but can be lossy Redundancy Removal Lossless compression: predictive
More informationBandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)
Compression Motivation Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet) Storage: Store large & complex 3D models (e.g. 3D scanner
More informationMultimedia Communications. Mathematical Preliminaries for Lossless Compression
Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when
More informationSUPPLEMENTARY INFORMATION
SUPPLEMENTARY INFORMATION doi:10.1038/nature11875 Method for Encoding and Decoding Arbitrary Computer Files in DNA Fragments 1 Encoding 1.1: An arbitrary computer file is represented as a string S 0 of
More informationInformation and Entropy
Information and Entropy Shannon s Separation Principle Source Coding Principles Entropy Variable Length Codes Huffman Codes Joint Sources Arithmetic Codes Adaptive Codes Thomas Wiegand: Digital Image Communication
More informationCSCI 2570 Introduction to Nanocomputing
CSCI 2570 Introduction to Nanocomputing Information Theory John E Savage What is Information Theory Introduced by Claude Shannon. See Wikipedia Two foci: a) data compression and b) reliable communication
More informationEnumeration and symmetry of edit metric spaces. Jessie Katherine Campbell. A dissertation submitted to the graduate faculty
Enumeration and symmetry of edit metric spaces by Jessie Katherine Campbell A dissertation submitted to the graduate faculty in partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2011, 28 November 2011 Memoryless Sources Arithmetic Coding Sources with Memory 2 / 19 Summary of last lecture Prefix-free
More informationA Repetitive Corpus Testbed
Chapter 3 A Repetitive Corpus Testbed In this chapter we present a corpus of repetitive texts. These texts are categorized according to the source they come from into the following: Artificial Texts, Pseudo-
More informationDigital communication system. Shannon s separation principle
Digital communication system Representation of the source signal by a stream of (binary) symbols Adaptation to the properties of the transmission channel information source source coder channel coder modulation
More information3F1 Information Theory, Lecture 3
3F1 Information Theory, Lecture 3 Jossy Sayir Department of Engineering Michaelmas 2013, 29 November 2013 Memoryless Sources Arithmetic Coding Sources with Memory Markov Example 2 / 21 Encoding the output
More informationText Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2
Text Compression Jayadev Misra The University of Texas at Austin December 5, 2003 Contents 1 Introduction 1 2 A Very Incomplete Introduction to Information Theory 2 3 Huffman Coding 5 3.1 Uniquely Decodable
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationIntroduction to Sequence Alignment. Manpreet S. Katari
Introduction to Sequence Alignment Manpreet S. Katari 1 Outline 1. Global vs. local approaches to aligning sequences 1. Dot Plots 2. BLAST 1. Dynamic Programming 3. Hash Tables 1. BLAT 4. BWT (Burrow Wheeler
More information17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.
( c ) E p s t e i n, C a r t e r, B o l l i n g e r, A u r i s p a C h a p t e r 17: I n f o r m a t i o n S c i e n c e P a g e 1 CHAPTER 17: Information Science 17.1 Binary Codes Normal numbers we use
More informationMARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for
MARKOV CHAINS A finite state Markov chain is a sequence S 0,S 1,... of discrete cv s from a finite alphabet S where q 0 (s) is a pmf on S 0 and for n 1, Q(s s ) = Pr(S n =s S n 1 =s ) = Pr(S n =s S n 1
More information4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak
4. Quantization and Data Compression ECE 32 Spring 22 Purdue University, School of ECE Prof. What is data compression? Reducing the file size without compromising the quality of the data stored in the
More informationNoisy channel communication
Information Theory http://www.inf.ed.ac.uk/teaching/courses/it/ Week 6 Communication channels and Information Some notes on the noisy channel setup: Iain Murray, 2012 School of Informatics, University
More informationOn-line String Matching in Highly Similar DNA Sequences
On-line String Matching in Highly Similar DNA Sequences Nadia Ben Nsira 1,2,ThierryLecroq 1,,MouradElloumi 2 1 LITIS EA 4108, Normastic FR3638, University of Rouen, France 2 LaTICE, University of Tunis
More informationComplementary Contextual Models with FM-index for DNA Compression
2017 Data Compression Conference Complementary Contextual Models with FM-index for DNA Compression Wenjing Fan,WenruiDai,YongLi, and Hongkai Xiong Department of Electronic Engineering Department of Biomedical
More informationFast Progressive Wavelet Coding
PRESENTED AT THE IEEE DCC 99 CONFERENCE SNOWBIRD, UTAH, MARCH/APRIL 1999 Fast Progressive Wavelet Coding Henrique S. Malvar Microsoft Research One Microsoft Way, Redmond, WA 98052 E-mail: malvar@microsoft.com
More informationElectrical and Information Technology. Information Theory. Problems and Solutions. Contents. Problems... 1 Solutions...7
Electrical and Information Technology Information Theory Problems and Solutions Contents Problems.......... Solutions...........7 Problems 3. In Problem?? the binomial coefficent was estimated with Stirling
More informationContext tree models for source coding
Context tree models for source coding Toward Non-parametric Information Theory Licence de droits d usage Outline Lossless Source Coding = density estimation with log-loss Source Coding and Universal Coding
More informationINFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld
INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld University of Illinois at Chicago, Dept. of Electrical
More informationCS6304 / Analog and Digital Communication UNIT IV - SOURCE AND ERROR CONTROL CODING PART A 1. What is the use of error control coding? The main use of error control coding is to reduce the overall probability
More informationSIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding
SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationChapter 8: Introduction to Evolutionary Computation
Computational Intelligence: Second Edition Contents Some Theories about Evolution Evolution is an optimization process: the aim is to improve the ability of an organism to survive in dynamically changing
More informationMotivation for Arithmetic Coding
Motivation for Arithmetic Coding Motivations for arithmetic coding: 1) Huffman coding algorithm can generate prefix codes with a minimum average codeword length. But this length is usually strictly greater
More informationEntropies & Information Theory
Entropies & Information Theory LECTURE I Nilanjana Datta University of Cambridge,U.K. See lecture notes on: http://www.qi.damtp.cam.ac.uk/node/223 Quantum Information Theory Born out of Classical Information
More informationShannon-Fano-Elias coding
Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The
More informationIntroduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar
Introduction to Information Theory By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar Introduction [B.P. Lathi] Almost in all the means of communication, none produces error-free communication.
More informationChapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code
Chapter 3 Source Coding 3. An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code 3. An Introduction to Source Coding Entropy (in bits per symbol) implies in average
More informationCold Boot Attacks in the Discrete Logarithm Setting
Cold Boot Attacks in the Discrete Logarithm Setting B. Poettering 1 & D. L. Sibborn 2 1 Ruhr University of Bochum 2 Royal Holloway, University of London October, 2015 Outline of the talk 1 Introduction
More informationBioinformatics Chapter 1. Introduction
Bioinformatics Chapter 1. Introduction Outline! Biological Data in Digital Symbol Sequences! Genomes Diversity, Size, and Structure! Proteins and Proteomes! On the Information Content of Biological Sequences!
More informationChapter 9 Fundamental Limits in Information Theory
Chapter 9 Fundamental Limits in Information Theory Information Theory is the fundamental theory behind information manipulation, including data compression and data transmission. 9.1 Introduction o For
More informationLecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008
Lecture 8 Learning Sequence Motif Models Using Expectation Maximization (EM) Colin Dewey February 14, 2008 1 Sequence Motifs what is a sequence motif? a sequence pattern of biological significance typically
More informationData Compression Techniques
Data Compression Techniques Part 1: Entropy Coding Lecture 4: Asymmetric Numeral Systems Juha Kärkkäinen 08.11.2017 1 / 19 Asymmetric Numeral Systems Asymmetric numeral systems (ANS) is a recent entropy
More informationO 3 O 4 O 5. q 3. q 4. Transition
Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in
More informationData Compression Techniques
Data Compression Techniques Part 2: Text Compression Lecture 5: Context-Based Compression Juha Kärkkäinen 14.11.2017 1 / 19 Text Compression We will now look at techniques for text compression. These techniques
More informationImage and Multidimensional Signal Processing
Image and Multidimensional Signal Processing Professor William Hoff Dept of Electrical Engineering &Computer Science http://inside.mines.edu/~whoff/ Image Compression 2 Image Compression Goal: Reduce amount
More informationAlgorithms: COMP3121/3821/9101/9801
NEW SOUTH WALES Algorithms: COMP3121/3821/9101/9801 Aleks Ignjatović School of Computer Science and Engineering University of New South Wales TOPIC 4: THE GREEDY METHOD COMP3121/3821/9101/9801 1 / 23 The
More informationCoding on Countably Infinite Alphabets
Coding on Countably Infinite Alphabets Non-parametric Information Theory Licence de droits d usage Outline Lossless Coding on infinite alphabets Source Coding Universal Coding Infinite Alphabets Enveloppe
More informationCMPT 365 Multimedia Systems. Lossless Compression
CMPT 365 Multimedia Systems Lossless Compression Spring 2017 Edited from slides by Dr. Jiangchuan Liu CMPT365 Multimedia Systems 1 Outline Why compression? Entropy Variable Length Coding Shannon-Fano Coding
More informationLecture 12. Block Diagram
Lecture 12 Goals Be able to encode using a linear block code Be able to decode a linear block code received over a binary symmetric channel or an additive white Gaussian channel XII-1 Block Diagram Data
More informationLecture 1 : Data Compression and Entropy
CPS290: Algorithmic Foundations of Data Science January 8, 207 Lecture : Data Compression and Entropy Lecturer: Kamesh Munagala Scribe: Kamesh Munagala In this lecture, we will study a simple model for
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More information(Lys), resulting in translation of a polypeptide without the Lys amino acid. resulting in translation of a polypeptide without the Lys amino acid.
1. A change that makes a polypeptide defective has been discovered in its amino acid sequence. The normal and defective amino acid sequences are shown below. Researchers are attempting to reproduce the
More informationOn Universal Types. Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA. University of Minnesota, September 14, 2004
On Universal Types Gadiel Seroussi Hewlett-Packard Laboratories Palo Alto, California, USA University of Minnesota, September 14, 2004 Types for Parametric Probability Distributions A = finite alphabet,
More informationCompressing Tabular Data via Pairwise Dependencies
Compressing Tabular Data via Pairwise Dependencies Amir Ingber, Yahoo! Research TCE Conference, June 22, 2017 Joint work with Dmitri Pavlichin, Tsachy Weissman (Stanford) Huge datasets: everywhere - Internet
More informationSource Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria
Source Coding Master Universitario en Ingeniería de Telecomunicación I. Santamaría Universidad de Cantabria Contents Introduction Asymptotic Equipartition Property Optimal Codes (Huffman Coding) Universal
More information6. Iterative Methods for Linear Systems. The stepwise approach to the solution...
6 Iterative Methods for Linear Systems The stepwise approach to the solution Miriam Mehl: 6 Iterative Methods for Linear Systems The stepwise approach to the solution, January 18, 2013 1 61 Large Sparse
More informationCompression and Coding
Compression and Coding Theory and Applications Part 1: Fundamentals Gloria Menegaz 1 Transmitter (Encoder) What is the problem? Receiver (Decoder) Transformation information unit Channel Ordering (significance)
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationImplementation of Lossless Huffman Coding: Image compression using K-Means algorithm and comparison vs. Random numbers and Message source
Implementation of Lossless Huffman Coding: Image compression using K-Means algorithm and comparison vs. Random numbers and Message source Ali Tariq Bhatti 1, Dr. Jung Kim 2 1,2 Department of Electrical
More informationCSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression
CSEP 52 Applied Algorithms Spring 25 Statistical Lossless Data Compression Outline for Tonight Basic Concepts in Data Compression Entropy Prefix codes Huffman Coding Arithmetic Coding Run Length Coding
More informationSequence analysis and Genomics
Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationMolecular Evolution & the Origin of Variation
Molecular Evolution & the Origin of Variation What Is Molecular Evolution? Molecular evolution differs from phenotypic evolution in that mutations and genetic drift are much more important determinants
More informationHuffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University
Huffman Coding C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University http://www.csie.nctu.edu.tw/~cmliu/courses/compression/ Office: EC538 (03)573877 cmliu@cs.nctu.edu.tw
More informationLecture 3 : Algorithms for source coding. September 30, 2016
Lecture 3 : Algorithms for source coding September 30, 2016 Outline 1. Huffman code ; proof of optimality ; 2. Coding with intervals : Shannon-Fano-Elias code and Shannon code ; 3. Arithmetic coding. 1/39
More informationComputation Theory Finite Automata
Computation Theory Dept. of Computing ITT Dublin October 14, 2010 Computation Theory I 1 We would like a model that captures the general nature of computation Consider two simple problems: 2 Design a program
More informationAsymptotic redundancy and prolixity
Asymptotic redundancy and prolixity Yuval Dagan, Yuval Filmus, and Shay Moran April 6, 2017 Abstract Gallager (1978) considered the worst-case redundancy of Huffman codes as the maximum probability tends
More informationIntermittent Communication
Intermittent Communication Mostafa Khoshnevisan, Student Member, IEEE, and J. Nicholas Laneman, Senior Member, IEEE arxiv:32.42v2 [cs.it] 7 Mar 207 Abstract We formulate a model for intermittent communication
More informationMAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)
MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK SATELLITE COMMUNICATION DEPT./SEM.:ECE/VIII UNIT V PART-A 1. What is binary symmetric channel (AUC DEC 2006) 2. Define information rate? (AUC DEC 2007)
More informationChapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0
Part II Information Theory Concepts Chapter 2 Source Models and Entropy Any information-generating process can be viewed as a source: { emitting a sequence of symbols { symbols from a nite alphabet text:
More informationReducing storage requirements for biological sequence comparison
Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,
More informationMutual information content of homologous DNA sequences
Mutual information content of homologous DNA sequences 55 Mutual information content of homologous DNA sequences Helena Cristina G. Leitão, Luciana S. Pessôa and Jorge Stolfi Instituto de Computação, Universidade
More informationBio nformatics. Lecture 3. Saad Mneimneh
Bio nformatics Lecture 3 Sequencing As before, DNA is cut into small ( 0.4KB) fragments and a clone library is formed. Biological experiments allow to read a certain number of these short fragments per
More information"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky
MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally
More informationPiecewise Constant Prediction
Piecewise Constant Prediction Erik Ordentlich Information heory Research Hewlett-Packard Laboratories Palo Alto, CA 94304 Email: erik.ordentlich@hp.com Marcelo J. Weinberger Information heory Research
More informationMultimedia. Multimedia Data Compression (Lossless Compression Algorithms)
Course Code 005636 (Fall 2017) Multimedia Multimedia Data Compression (Lossless Compression Algorithms) Prof. S. M. Riazul Islam, Dept. of Computer Engineering, Sejong University, Korea E-mail: riaz@sejong.ac.kr
More informationEntropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea
Connectivity coding Entropy Coding dd 7, dd 6, dd 7, dd 5,... TG output... CRRRLSLECRRE Entropy coder output Connectivity data Edgebreaker output Digital Geometry Processing - Spring 8, Technion Digital
More informationECE533 Digital Image Processing. Embedded Zerotree Wavelet Image Codec
University of Wisconsin Madison Electrical Computer Engineering ECE533 Digital Image Processing Embedded Zerotree Wavelet Image Codec Team members Hongyu Sun Yi Zhang December 12, 2003 Table of Contents
More informationSelective Use Of Multiple Entropy Models In Audio Coding
Selective Use Of Multiple Entropy Models In Audio Coding Sanjeev Mehrotra, Wei-ge Chen Microsoft Corporation One Microsoft Way, Redmond, WA 98052 {sanjeevm,wchen}@microsoft.com Abstract The use of multiple
More informationData Structures in Java
Data Structures in Java Lecture 20: Algorithm Design Techniques 12/2/2015 Daniel Bauer 1 Algorithms and Problem Solving Purpose of algorithms: find solutions to problems. Data Structures provide ways of
More informationChapter 2: Source coding
Chapter 2: meghdadi@ensil.unilim.fr University of Limoges Chapter 2: Entropy of Markov Source Chapter 2: Entropy of Markov Source Markov model for information sources Given the present, the future is independent
More informationEvolution of Genotype-Phenotype mapping in a von Neumann Self-reproduction within the Platform of Tierra
Evolution of Genotype-Phenotype mapping in a von Neumann Self-reproduction within the Platform of Tierra Declan Baugh and Barry Mc Mullin The Rince Institute, Dublin City University, Ireland declan.baugh2@mail.dcu.ie,
More informationSara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)
Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline
More informationMAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A
MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK DEPARTMENT: ECE SEMESTER: IV SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A 1. What is binary symmetric channel (AUC DEC
More informationExample: sending one bit of information across noisy channel. Effects of the noise: flip the bit with probability p.
Lecture 20 Page 1 Lecture 20 Quantum error correction Classical error correction Modern computers: failure rate is below one error in 10 17 operations Data transmission and storage (file transfers, cell
More information+ (50% contribution by each member)
Image Coding using EZW and QM coder ECE 533 Project Report Ahuja, Alok + Singh, Aarti + + (50% contribution by each member) Abstract This project involves Matlab implementation of the Embedded Zerotree
More informationLinear-Space Alignment
Linear-Space Alignment Subsequences and Substrings Definition A string x is a substring of a string x, if x = ux v for some prefix string u and suffix string v (similarly, x = x i x j, for some 1 i j x
More informationCommunications Theory and Engineering
Communications Theory and Engineering Master's Degree in Electronic Engineering Sapienza University of Rome A.A. 2018-2019 AEP Asymptotic Equipartition Property AEP In information theory, the analog of
More informationChapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code
Chapter 2 Date Compression: Source Coding 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code 2.1 An Introduction to Source Coding Source coding can be seen as an efficient way
More informationObjective: Reduction of data redundancy. Coding redundancy Interpixel redundancy Psychovisual redundancy Fall LIST 2
Image Compression Objective: Reduction of data redundancy Coding redundancy Interpixel redundancy Psychovisual redundancy 20-Fall LIST 2 Method: Coding Redundancy Variable-Length Coding Interpixel Redundancy
More informationA Comparison of Methods for Redundancy Reduction in Recurrence Time Coding
1 1 A Comparison of Methods for Redundancy Reduction in Recurrence Time Coding Hidetoshi Yokoo, Member, IEEE Abstract Recurrence time of a symbol in a string is defined as the number of symbols that have
More informationThe Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The
The Complete Set Of Genetic Instructions In An Organism's Chromosomes Is Called The What is a genome? A genome is an organism's complete set of genetic instructions. Single strands of DNA are coiled up
More informationDigital Communications III (ECE 154C) Introduction to Coding and Information Theory
Digital Communications III (ECE 154C) Introduction to Coding and Information Theory Tara Javidi These lecture notes were originally developed by late Prof. J. K. Wolf. UC San Diego Spring 2014 1 / 8 I
More information