Introduction to Information Theory. Part 3

Similar documents
Introduction to Information Theory. Part 2

Introduction to Information Theory. Part 4

CS4800: Algorithms & Data Jonathan Ullman

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

Source Coding Techniques

Chapter 2: Source coding

Basic Principles of Lossless Coding. Universal Lossless coding. Lempel-Ziv Coding. 2. Exploit dependences between successive symbols.

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

Lecture 4 : Adaptive source coding algorithms

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Image and Multidimensional Signal Processing

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

CSEP 590 Data Compression Autumn Dictionary Coding LZW, LZ77

Text Compression. Jayadev Misra The University of Texas at Austin December 5, A Very Incomplete Introduction to Information Theory 2

Chapter 9 Fundamental Limits in Information Theory

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

6.02 Fall 2012 Lecture #1

3F1 Information Theory, Lecture 3

CSE 421 Greedy: Huffman Codes

Ch 0 Introduction. 0.1 Overview of Information Theory and Coding

Coding for Discrete Source

A Simple Introduction to Information, Channel Capacity and Entropy

UNIT I INFORMATION THEORY. I k log 2

ELEC 515 Information Theory. Distortionless Source Coding

Shannon's Theory of Communication

Kotebe Metropolitan University Department of Computer Science and Technology Multimedia (CoSc 4151)

3F1 Information Theory, Lecture 3

Data Compression Techniques

Lecture 10 : Basic Compression Algorithms

CMPT 365 Multimedia Systems. Lossless Compression

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Exercise 1: Basics of probability calculus

ECE472/572 - Lecture 11. Roadmap. Roadmap. Image Compression Fundamentals and Lossless Compression Techniques 11/03/11.

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

Information Theory (Information Theory by J. V. Stone, 2015)

Fibonacci Coding for Lossless Data Compression A Review

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

COMM901 Source Coding and Compression. Quiz 1

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

CMPT 365 Multimedia Systems. Final Review - 1


A Mathematical Theory of Communication

CSCI 2570 Introduction to Nanocomputing

Digital communication system. Shannon s separation principle

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Compror: On-line lossless data compression with a factor oracle

Motivation for Arithmetic Coding

Entropies & Information Theory

Entropy as a measure of surprise

(Classical) Information Theory III: Noisy channel coding

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

lossless, optimal compressor

Lecture 3 : Algorithms for source coding. September 30, 2016

ECE 587 / STA 563: Lecture 5 Lossless Compression

Dept. of Linguistics, Indiana University Fall 2015

ECE 587 / STA 563: Lecture 5 Lossless Compression

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

MARKOV CHAINS A finite state Markov chain is a sequence of discrete cv s from a finite alphabet where is a pmf on and for

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

CHAPTER 8 COMPRESSION ENTROPY ESTIMATION OF HEART RATE VARIABILITY AND COMPUTATION OF ITS RENORMALIZED ENTROPY

Turbo Compression. Andrej Rikovsky, Advisor: Pavol Hanus

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Data Compression. Limit of Information Compression. October, Examples of codes 1

Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

repetition, part ii Ole-Johan Skrede INF Digital Image Processing

Reduce the amount of data required to represent a given quantity of information Data vs information R = 1 1 C

Improved Lempel-Ziv-Welch s Error Detection and Correction Scheme using Redundant Residue Number System (RRNS)

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

D CUME N TAT ION C 0 R P 0 R A T E D 2,21 CONNECTICUT AVENUE, N. W,

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Lecture 1: Shannon s Theorem

Massachusetts Institute of Technology. Problem Set 4 Solutions

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Coding, Information Theory (and Advanced Modulation) Prof. Jay Weitzen Ball 411

! Where are we on course map? ! What we did in lab last week. " How it relates to this week. ! Compression. " What is it, examples, classifications

Data compression. Harald Nautsch ISY Informationskodning, Linköpings universitet.

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

Digital Communication Systems ECS 452

Multimedia Information Systems

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

18.310A Final exam practice questions

City, University of London Institutional Repository

Compression. What. Why. Reduce the amount of information (bits) needed to represent image Video: 720 x 480 res, 30 fps, color

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

Stream Codes. 6.1 The guessing game

SIGNAL COMPRESSION Lecture 7. Variable to Fix Encoding

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

4F5: Advanced Communications and Coding Handout 2: The Typical Set, Compression, Mutual Information

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Chaos, Complexity, and Inference (36-462)

Compressing Kinetic Data From Sensor Networks. Sorelle A. Friedler (Swat 04) Joint work with David Mount University of Maryland, College Park

Information and Entropy

DCSP-3: Minimal Length Coding. Jianfeng Feng

Introduction to information theory and coding

U Logo Use Guidelines

Transcription:

Introduction to Information Theory Part 3

Assignment#1 Results List text(s) used, total # letters, computed entropy of text. Compare results. What is the computed average word length of 3 letter codes on Compare results. 9/25/2012 2

A General Communication System CHANNEL Information Source Transmitter Channel Receiver Destination 9/25/2012 REVIEW 3

Information Source Definition of Information: Information ( ) is associated with known events/messages Entropy ( ) is the average information w.r.to all possible outcomes. Given,,,, log 1 REVIEW 9/25/2012 4

Source Coding: Basics Block code: When all codes have the same length. For example, ASCII (8 bits) Average Word Length: More generally, 1 A code is efficient if it has the smallest average word length. (Turns out entropy is the benchmark ) REVIEW 9/25/2012 5

Average Code Length & Entropy Average length bounds: 1 Grouping symbols together: 1 1 1 lim H REVIEW 9/25/2012 6

Shannon s First Theorem By coding sequences of independent symbols (in ), it is possible to construct codes such that The price paid for such improvement is increased coding complexity (due to increased ) and increased delay in coding. REVIEW 9/25/2012 7

Entropy & Coding: Central Ideas Use short codes for highly likely events. This shortens the average length of coded messages. Code several events at a time. Provides greater flexibility in code design. Shannon s Source Coding Theorem Algorithms Applications REVIEW 9/25/2012 8

Source Coding Efficient codes Singular codes Nonsingular codes Instantaneous codes Universal Codes Codes that do not require knowledge of probability distribution but act in the limit as if they did have the knowledge. REVIEW 9/25/2012 9

Huffman Codes Nonsingular Instantaneous Efficient Non unique Powers of a source lead closer to Requires knowledge of symbol probabilities REVIEW 9/25/2012 10

Entropy & Coding: Central Ideas Use short codes for highly likely events. This shortens the average length of coded messages. Code several events at a time. Provides greater flexibility in code design. Shannon s Source Coding Theorem Algorithms: Huffman Encoding, Applications: Compression REVIEW 9/25/2012 11

xkcd(#936): Password Strength 9/25/2012 12

Some Observations Successive symbols from a source are not always independent. E.g. in english, h is more likely to occur following a t. This intersymbol dependency must be accounted for in an accurate measure of entropy. 9/25/2012 13

Lossless Compression: English Text How much can we compress a given text? What is the accurate measure of entropy of English texts? Zeroth Order entropy First Order Entropy: Second Order Entropy: 1 log 27 4.755 / 3.3 3.1 9/25/2012 14

Zipf s Law, 1949 ~1/ is the frequency of occurrence of the ranked item and is close to 1. The most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. For example, in a corpus of over 1 million words: the 69,971 7% of 36.411 3.5% and 28,852 2.9% For English text: where is the word s rank, is the probability of the rank word, A is a constant that depends on the number of active words in the language. 9/25/2012 15

Zipf s Law 9/25/2012 16

Estimating Entropy of English Text For English text, with words: where is the word s rank, is the probability of the rank word, A is a constant that depends on the number of active words in the language. If 0.1, so that We need, 12,366. 1 If =4.5 letters/word,.01 log.01.. 2.16bits/word 9.72 / 9/25/2012 17

The Long Tail of Moby Dick 9/25/2012 18

Zipf s Law and Populations 9/25/2012 19

Shannon Redundancy 1 Where is the per letter entropy, is the size of the source alphabet. Thus redundancy of English is 1 2.16 log 27 54.6% With an entropy of 1.5 we get 67% redundancy. I.e. Huffman coding (even with an entropy of 3.3 or 3.1) will not get close to the theoretical limit. Can we achieve compression rates close to 33%??? 9/25/2012 20

Lempel Ziv Coding Sequences of text repeat patterns (words, phrases, etc) Construct a dictionary of common patterns Send references to patterns as triples 9/25/2012 21

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 9/25/2012 22

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 9/25/2012 23

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 9/25/2012 24

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 9/25/2012 25

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 9/25/2012 26

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 9/25/2012 27

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 5 2 E E T H I S T H ES I S I S T H E T H E S I S. 9/25/2012 28

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 5 2 E THE T H I S T H ES I S I S T H E T H E S I S. 5 1 I SI T H I S T H E S IS I S T H E T H E S I S. 9/25/2012 29

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 5 2 E THE T H I S T H ES I S I S T H E T H E S I S. 5 1 I SI T H I S T H E S IS I S T H E T H E S I S. 7 2 I S I T H I S T H E S I S IS T H E T H E S I S. 9/25/2012 30

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 5 2 E THE T H I S T H ES I S I S T H E T H E S I S. 5 1 I SI T H I S T H E S IS I S T H E T H E S I S. 7 2 I S I T H I S T H E S I S IS T H E T H E S I S. 10 5 S THE T H I S T H E S I S I S T H E T H E S I S. 9/25/2012 31

Lempel Ziv Coding (LZ77) Message Search Buffer Look Ahead Buffer TH I S T H E S I S I S T H E T H E S I S. 0 0 T T TH I S T H E S I S I S T H E T H E S I S. 0 0 H H TH I S T H E S I S I S T H E T H E S I S. 0 0 I I T H IS T H E S I S I S T H E T H E S I S. 0 0 S S T H I S T H E S I S I S T H E T H E S I S. 0 0 T H I S T H E S I S I S T H E T H E S I S. 5 2 E THE T H I S T H ES I S I S T H E T H E S I S. 5 1 I SI T H I S T H E S IS I S T H E T H E S I S. 7 2 I S I T H I S T H E S I S IS T H E T H E S I S. 10 5 S THE T H I S T H E S I S I S T H E T H E S I S. 14 6. THESIS. T H I S T H E S I S I S T H E TH E S I S. 9/25/2012 32

Lempel Ziv Coding Sequences of text repeat patterns (words, phrases, etc) Construct a dictionary of common patterns Send references to patterns as triples,, e.g. 5, 3, go back 5 received chars take the next 3 from there add to the end Size of Search Buffer and Look Ahead Buffer is finite. Used by ZIP, PKSip, Lharc, PNG, gzip, ARJ Extended to LZ78 (uses dictionary), LZW (+Terry Welch) Achieves optimal rate of transmission in the long run w/o using probability dist. 9/25/2012 33

Decode Message 0 0 I 0 0 0 0 M 3 1 S 1 1 5 5 L 5 3 Y 9/25/2012 34

Decode Message 0 0 I I 0 0 0 0 M 3 1 S 1 1 5 5 L 5 3 Y 9/25/2012 35

Decode Message 0 0 I 0 0 I I 0 0 M 3 1 S 1 1 5 5 L 5 3 Y 9/25/2012 36

Decode 0 0 I 0 0 0 0 M I M Message I I 3 1 S 1 1 5 5 L 5 3 Y 9/25/2012 37

Decode 0 0 I 0 0 0 0 M I M Message I I 3 1 S I M I S 1 1 5 5 L 5 3 Y 9/25/2012 38

Decode 0 0 I 0 0 0 0 M I M Message I I 3 1 S 1 1 I M I S I M I S S 5 5 L 5 3 Y 9/25/2012 39

Decode 0 0 I 0 0 0 0 M I M Message I I 3 1 S 1 1 5 5 L I M I S I M I S S I M I S S M I S S L 5 3 Y 9/25/2012 40

Decode 0 0 I 0 0 0 0 M I M Message I I 3 1 S 1 1 5 5 L 5 3 Y I M I S I M I S S I M I S S M I S S L I M I S S M I S S LI S SY 9/25/2012 41

References Eugene Chiu, Jocelyn Lin, Brok Mcferron, Noshirwan Petigara, Satwiksai Seshasai: Mathematical Theory of Claude Shannon: A study of the style and context of his work up to the genesis of information theory. MIT 6.933J / STS.420J The Structure of Engineering Revolutions Luciano Floridi, 2010: Information: A Very Short Introduction, Oxford University Press, 2011. Luciano Floridi, 2011: The Philosophy of Information, Oxford University Press, 2011. James Gleick, 2011: The Information: A History, A Theory, A Flood, Pantheon Books, 2011. Zhandong Liu, Santosh S Venkatesh and Carlo C Maley, 2008: Sequence space coverage, entropy of genomes and the potential to detect non human DNA in human samples, BMC Genomics 2008, 9:509 David Luenberger, 2006: Information Science, Princeton University Press, 2006. David J.C. MacKay, 2003: Information Theory, Inference, and Learning Algorithms, Cambridge University Press, 2003. Claude Shannon & Warren Weaver, 1949: The Mathematical Theory of Communication, University of Illinois Press, 1949. W. N. Francis and H. Kucera: Brown University Standard Corpus of Present Day American English, Brown University, 1967. Edward L. Glaeser: A Tale of Many Cities, New York Times, April 10, 2010. Available at: http://economix.blogs.nytimes.com/2010/04/20/a tale of many cities/ Alan Rimm Kaufman, The Long Tail of Search. Search Engine Land Website, September 18, 2007. Available at: http://searchengineland.com/the long tail of search 12198 42