Information Theory (Information Theory by J. V. Stone, 2015)

Similar documents
Shannon's Theory of Communication

Information in Biology

UNIT I INFORMATION THEORY. I k log 2

Information in Biology

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

Information and Entropy. Professor Kevin Gold

channel of communication noise Each codeword has length 2, and all digits are either 0 or 1. Such codes are called Binary Codes.

6.02 Fall 2012 Lecture #1

3F1 Information Theory, Lecture 3

Digital communication system. Shannon s separation principle

Introduction to Information Theory

Quantitative Biology Lecture 3

2018/5/3. YU Xiangyu

MAHALAKSHMI ENGINEERING COLLEGE-TRICHY QUESTION BANK UNIT V PART-A. 1. What is binary symmetric channel (AUC DEC 2006)

Entropy as a measure of surprise

Natural Image Statistics and Neural Representations

Compression and Coding

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Shannon s noisy-channel theorem

MAHALAKSHMI ENGINEERING COLLEGE QUESTION BANK. SUBJECT CODE / Name: EC2252 COMMUNICATION THEORY UNIT-V INFORMATION THEORY PART-A

Basic Principles of Video Coding

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

3F1 Information Theory, Lecture 3

(Classical) Information Theory III: Noisy channel coding

Hidden Markov Models 1

Entropies & Information Theory

Chapter 9 Fundamental Limits in Information Theory

CSCI 2570 Introduction to Nanocomputing

Exercise 1. = P(y a 1)P(a 1 )

Information Theory and Coding Techniques

Information Theory, Statistics, and Decision Trees

Image Data Compression

Chapter 2: Source coding

3F1 Information Theory, Lecture 1

Lecture 1: Shannon s Theorem

EE376A: Homework #3 Due by 11:59pm Saturday, February 10th, 2018

ELEMENT OF INFORMATION THEORY

Image Compression. Fundamentals: Coding redundancy. The gray level histogram of an image can reveal a great deal of information about the image

arxiv: v2 [cs.it] 20 Feb 2018

Do Neurons Process Information Efficiently?

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Shannon s Noisy-Channel Coding Theorem


Information and Entropy

Channel capacity. Outline : 1. Source entropy 2. Discrete memoryless channel 3. Mutual information 4. Channel capacity 5.

CSE468 Information Conflict

6.02 Fall 2011 Lecture #9

9 THEORY OF CODES. 9.0 Introduction. 9.1 Noise

An introduction to basic information theory. Hampus Wessman

Information Theory. Coding and Information Theory. Information Theory Textbooks. Entropy

Introduction to Information Theory. Part 2

Revision of Lecture 5

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

Digital Communications III (ECE 154C) Introduction to Coding and Information Theory

Basic information theory

EE376A - Information Theory Final, Monday March 14th 2016 Solutions. Please start answering each question on a new page of the answer booklet.

Murray Gell-Mann, The Quark and the Jaguar, 1995

Information. = more information was provided by the outcome in #2

A Gentle Tutorial on Information Theory and Learning. Roni Rosenfeld. Carnegie Mellon University

Introduction to Information Theory. By Prof. S.J. Soni Asst. Professor, CE Department, SPCE, Visnagar

Lectures on Medical Biophysics Department of Biophysics, Medical Faculty, Masaryk University in Brno. Biocybernetics

Chapter 2 Source Models and Entropy. Any information-generating process can be viewed as. computer program in executed form: binary 0

Coding of memoryless sources 1/35

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

Quantum Information Theory and Cryptography

BASICS OF COMPRESSION THEORY

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Objective 3.01 (DNA, RNA and Protein Synthesis)

Intro to Information Theory

EC2252 COMMUNICATION THEORY UNIT 5 INFORMATION THEORY

Introduction to Information Theory. Part 4

INTRODUCTION TO INFORMATION THEORY

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

Chapter 2: Entropy and Mutual Information. University of Illinois at Chicago ECE 534, Natasha Devroye

9. Distance measures. 9.1 Classical information measures. Head Tail. How similar/close are two probability distributions? Trace distance.

ECE Information theory Final (Fall 2008)

Introduction to Information Theory. Part 3

Exercises with solutions (Set B)

Lecture 4 Channel Coding

X 1 : X Table 1: Y = X X 2

An instantaneous code (prefix code, tree code) with the codeword lengths l 1,..., l N exists if and only if. 2 l i. i=1

Image Compression. Qiaoyong Zhong. November 19, CAS-MPG Partner Institute for Computational Biology (PICB)

Source Coding Techniques

Genetic Algorithms. Donald Richards Penn State University

Lecture 7: DecisionTrees

Block 2: Introduction to Information Theory

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Lecture 6 I. CHANNEL CODING. X n (m) P Y X

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

Multimedia Networking ECE 599

L. Yaroslavsky. Fundamentals of Digital Image Processing. Course

Appendix B Information theory from first principles

Wavelet Scalable Video Codec Part 1: image compression by JPEG2000

Neural coding Ecological approach to sensory coding: efficient adaptation to the natural environment

Noisy channel communication

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Noisy-Channel Coding

Transcription:

Information Theory (Information Theory by J. V. Stone, 2015)

Claude Shannon (1916 2001) Shannon, C. (1948). A mathematical theory of communication. Bell System Technical Journal, 27:379 423. A mathematical definition of information How much information can be communicated between different elements of a system Whether we consider computers, evolution, physics, artificial intelligence, quantum computation, or the brain, their behaviors are largely determined by the way they process information. Information is a well-defined and measurable quantity as important as mass and velocity in describing the universe. Shannon s theory ranks alongside those of Darwin Wallace, Newton, and Einstein. 2

Information vs. Data and Signal vs. Noise Information (useful data or signal) is embedded in data which consists of signal and noise (note: signal-to-noise ratio, SNR). 1 bit is the amount of information required to choose between two equally probable alternatives (note: bit vs. binary digit). n bits produce m = 2 n equally probable alternatives. If you have n bits of information, then you can choose from m = 2 n equally probable alternatives. m equally probable alternatives possess n = log 2 m bits of information. If you have to choose between m equally probable alternatives, then you need n = log 2 m bits of information. 3

Finding a Route, Bit by Bit 1 bit is the amount of information required to choose between two equally probable alternatives. How much information is needed to correctly arrive at the point D when you have no information about the route? What is the meaning (or usefulness) of the first 1 bit of information? What is the meaning (or usefulness) of the second 1 bit of information? What if you already knew you should turn right? What is the meaning (or usefulness) of the third 1 bit of information? What if you already knew that there was a 71% probability that right turn is correct? 4

A Million Answers to Twenty Questions Toy problem of three questions: Is it inanimate? Is it a mammal? Is it cat? Does the ordering of the words matter? With 20 questions, how many words can you deal with? Note 2 20 = 1,048,576 10 6. How about 40 questions? Note 2 40 = 1,099,511,627,776 10 12. If you made 40 turns to drive from Seoul to Budang, you have avoided arriving at 1,099,511,627,775 incorrect destinations. 1 bit is the amount of information required to choose between two equally probable alternatives. 5

Telegraphy Electric current produces magnetic field. 26 pairs of electric lines or 23 or 1 (how about bidirectional telegraphy?) Morse code for efficient use of a single channel Short codewords for most common letters E, T, A, N, M, etc. Longer codewords for less common letters Z, Q, J, X, Y, etc. 6

Image and Pixels An image is a 2D array of pixels (picture elements). Each pixel is represented as a number expressing luminance, color, X-ray absorption, proton density, temperature, or others. An image is a 2D array of numbers which can be reformatted as a long 1D array of numbers. The number representing a pixel is a value chosen from a range such as [0, 255]. If a pixel takes one value from m equally possible values, the pixel conveys n = log 2 m bits of information. 7

Binary (B&W) Image Most images have predictable internal structures with redundant information. Encoding method A: coding 1 bit (0 or 1) per pixel for 100 100 pixels (10,000 bits of information?) Encoding method B: coding the locations of white pixels Encoding method C: coding the numbers of black pixels before the next white pixels Encoding method D (run-length coding): coding the numbers of pixels preceding the next changes from 0 to 1 or from 1 to 0 All of these methods can be lossless. What is the amount of information in each of these images? 8

8-bit Gray Scale Image Direct coding: 100 100 8 = 80,000 bits [0, 255] Difference coding: 100 100 7 = 70,000 bits (12.5% reduction) [-63, 63] How much information does each pixel contain? The smallest number of binary digits required to represent each pixel is equal to the amount of information (measured in bits) implicit in each pixel. The average information per pixel turns out to be 5.92 bits. The image can be compressed without loss of information. 9

Data Compression and Difference Coding 126 million photoreceptors 1 million nerve fibers 10

Encoding and Decoding in Human Vision System 11

Efficient Coding Hypothesis and Evolution Evolution of sense organs, and of the brains that process data from those organs, is primarily driven by the need to minimize the energy expended for each bit of information acquired from the environment. The ability to separate signal from noise is fundamental to the Darwin-Wallace theory of evolution by natural selection. Evolution works by selecting the individuals best suited to a particular environment so that, over many generations, information about the environment gradually accumulates within the gene pool. Thus, natural selection is essentially a means by which information about the environment is incorporated into DNA (deoxyribonucleic acid). 12

Television and Teleaudition? Acoustic Signal 13

Information Processing System The ability to separate signal from noise, to extract information from data, is crucial for all information processing systems. Television and all of those modern devices and systems Human nervous system and all of those biological systems 14

Television and Broadcasting System 15

Human Nervous System 16 Malmivuo and Plonsey, 1995

Information Processing System Source (Transmitter) Receiver Data s Data s Encoder x = g(s) Noise Decoder Input x Channel Output y Probability theory Entropy Information theory 17

Outcome, Sample Space, Outcome Value, Random Variable, and Probability Experiment Coin flip Outcome = head Random variable X Mapping X(head) = 1 Experiment Coin flip Outcome = tail Random variable X Mapping X(tail) = 0 Outcome: head (x h ) or tail (x t ) Sample space: A x = {x h, x t } Outcome value: 1 or 0 Random variable: X(x h ) = 1 and X(x t ) = 0 Probability distribution: p(x) = {p(x=x h ), p(x=x t )} = {p(x h ), p(x t )} 18

Probability Distributions Fair Coin Biased Coin Fair Die with 8 Sides 19

Random Variable A random variable X is a function that maps each outcome x of an experiment (e.g. a coin flip) to a number X(x), which is the outcome value of x. If the outcome value of x is 1, this may be written as X = 1 or x = 1. Examples of a 6-sided die may include X = 1, if x is 1 2, if x is 2 3, if x is 3 4, if x is 4 5, if x is 5 6, if x is 6 or X = 0, if x is 1, 3, or 5 1, if x is 2, 4, or 6 The random variable X is associated with its probability distribution p(x) = {p(x 1 ), p(x 2 ),, p(x k )} and sample space A x = {x 1, x 2,, x k }. 20

Information Processing System Source (Transmitter) Receiver Data s Data s Encoder x = g(s) Noise Decoder Input x Channel Output y 21

Source and Message Message is an ordered sequence of k symbols: s = s 1, s 2,, s k Each symbol s i is an outcome (value) of a random variable S with a sample space A s = s 1, s 2,, s α. The random variable S is associated with the probability distribution p S = p(s 1 ), p(s 2 ),, p(s α ) and α i=1 p(s i ) = 1. 22

Encoder and Channel Input An encoder transforms the message s to a channel input x = g(s). The channel input x = x 1, x 2,, x n is an ordered sequence of codewords x i. Each codeword x i is an outcome (value) of a random variable X with a sample space A x = x 1, x 2,, x m which is a codebook. The random variable X is associated with the probability distribution p X = p(x 1 ), p(x 2 ),, p(x m ) and m i=1 p(x i ) = 1. The probability of a codeword x i is p(x i ). 23

Code A code is a list of symbols and their corresponding codewords. A code can be envisaged as a look-up table such as Symbol Codeword s 1 = 3 x 1 = 000 s 2 = 6 x 2 = 001 s 3 = 9 x 3 = 010 s 4 = 12 x 4 = 011 s 5 = 15 x 5 = 100 s 6 = 18 x 6 = 101 s 7 = 21 x 7 = 110 s 8 = 24 x 8 = 111 24

Channel and Channel Output A channel produces a channel output y = y 1, y 2,, y n which is an ordered sequence of n values y i. Each channel output value y i is an outcome (value) of a random variable Y with a sample space A y = y 1, y 2,, y m. The random variable Y is associated with the probability distribution p Y = p(y 1 ), p(y 2 ),, p(y m ) and m i=1 p(y i ) = 1. The probability of the channel output value y i is p(y i ). 25

Decoder and Received Message A decoder transforms the channel output y to a received message r = g 1 (y) using the code. The received message may contain an error as r = g 1 x + η = s + e due to channel noise. The error rate of a code is the number of incorrect inputs associated with the codebook divided by the number of possible inputs. 26

Channel Capacity Channel capacity is the maximum amount of information that can be communicated through the channel. Channel capacity is measured in terms of the amount of information per symbol as bits/symbol. If a channel communicates n symbols per second, its capacity is expressed in terms of information per second as bits/s. If an alphabet of a symbols is transmitted through a noiseless channel at n symbols/s, its channel capacity is nlog 2 a bits/s. The rate of information through a given channel is less than the channel capacity due to noise and/or code inefficiency. The channel capacity is the maximum rate that can be achieved when considered over all possible codes. 27

Channel Capacity Data s Data s Encoder x = g(s) Noise Decoder Input x Channel Output y 28

Shannon s Desiderata : Properties of Information Continuity. The amount of information associated with an outcome increases or decreases continuously (smoothly) as the probability of the outcome changes. Symmetry. The amount of information associated with a sequence of outcomes does not depend on the order of the sequence. Maximal value. The amount of information associated with a set of outcomes cannot be increased if those outcomes are already equally probable. Additivity. The information associated with a set of outcomes is obtained by adding the information of individual outcomes. There is only one definition of information satisfying all of these four properties. 29

Information as Surprise Shannon information or surprisal of an outcome (value) x with probability of p x is h x = log 2 1 p(x) = log 2 p(x) Shannon information is a measure of surprise. bits 30

Entropy is Average Shannon Information Entropy is the average Shannon information or surprisal of a random variable X with its probability distribution p X = p(x 1 ), p(x 2 ),, p(x m ). H X = m i=1 p(x i )log 2 1 p(x i ) n H X h x = 1 n i=1 = E h(x) bits 1 log 2 p(x i ) = 1 n n i=1 h(x i ) bits The entropy of a random variable is the logarithm of the number of equally probable outcome values. H X = log 2 m bits H X m = 2 equally probable values 31

Entropy is a Measure of Uncertainty The average uncertainty of a variable X is summarized by its entropy H(X). If we are told the value of X, the amount of information we get is, on average, exactly equal to its entropy. Coin Fair Die with 8 Sides H(X) = 1 bit for fair coin H(X) = 3 bits 32

Amount of data in HDTV images Satellite TV 1920(H) 1080(V) 3(R,G,B) = 6,220,800 pixels 256 values per pixel means log 2 256 = 8 bits/pixel 6,220,800 8 = 49,766,400 bits/image 50 million bits/image Frame rate of 30 images/s produces 1,500 megabits/s Satellite channel Channel capacity = 19.2 megabits/s Remedy for an effective compression factor of about 78 1,500/19.2. Squeeze all the redundant data out of the images spatially and temporally (Moving Picture Experts Group, MPEG using cosine transform). Remove components invisible to the human eye (filtering, high-resolution for intensity, lowresolution for color). Recode the resultant data so that all symbols occur equally often. Add a small amount of redundancy for error correction. How about MP3 (MPEG1 or MPEC2 Audio Layer III) and JPEG (Joint Photographic Experts Group)? 33

Evolution by Natural Selection Evolution is a process where natural selection acts as a mechanism for transferring information from the environment to the collective genome of a species. Each individual is asked by Are the genes in this individual better or worse than average? The answer comes in units called fitness measured by the number of grownup offspring. Over many generations, information about the environment becomes implicit in the genome of a species. The human genome contains about 3 10 9 (3 billion) pairs of nucleotides. Each nucleotide comprises one element in one half of the double helix of the DNA molecule. The human genome makes use of four nucleotides including adenine (A), guanine (G), thymine (T), and cytosine (C). One may define the fitness as the proportion of good genes above 50%. 34

Does Sex Accelerate Evolution? (MacKay, 2003) The mutation rate of a genome of N nucleotides is the probability that each gene will be altered from one generation to the next (average proportion of altered genes). For a sexual population, the largest mutation rate is 1 N. For an asexual population, the largest mutation rate is 1 N. For a sexually reproducing population, the rate at which a genome of N nucleotides accumulates information from the environment can be as large as N bits/generation. For N = 3 10 9, the rate is 54,772 bits/generation. The collective genome of the current generation would have 54,772 bits more information about its environment than the genomes of the previous generation. For an asexually reproducing population, the rate at which a genome of nucleotides accumulates information from the environment is 1 bit/generation. 35

Does Sex Accelerate Evolution? (MacKay, 2003) New information N bits/generation New information = 1 bits/generation New Information Sexual N Asexual 1 Mutation Rate 1 N 1 N Ratio N N A large genome increases evolutionary speed but it also decreases tolerance to mutations. If N is large, evolution is fast but mutation is scarce. If N is small, evolution is slow but mutation is frequent. Evolution should have found a compromise in N. The collective genome of a species should maximizes the Shannon information acquired about its environment for each joule of expended energy. This efficient evolution is an extension of efficient coding hypothesis. 36

The Human Genome: How Much Information? The alphabets of DNA are A, G, T, and C. For a message with N letters, the number of possible messages is m = 4 N. The maximum amount of information conveyed by this message is H = log 2 4 N = N log 2 4 = 2N bits. For the human genome with N = 3 10 9 nucleotides, H = 6 10 9 or 6 billion bits of information. 6 billion bits of information is stored in the DNA inside the nucleus of every cell. 37

Enough DNA to Wire Up a Brain? Neurons are the only connection between us and the physical world. 10 11 neurons/brain 10 4 synapses/neuron = 10 15 synapses/brain 3 10 9 nucleotides in the human genome If each nucleotide specifies one synapse, the genome can encode only one millionth of all synapses without using any DNA for the rest of the body. Therefore, there is not enough DNA in the human genome to specify every single synaptic connection in the brain. The human brain must learn! Learning provides a way for a brain to use the information from the environment to specify the correct set of all of 10 15 synapses. 38

John Wheeler, 1986 Everything must be based on a simple idea. And this idea, once we have finally discovered it, will be so compelling, so beautiful, that we will say to one another, yes, how could it have been any different? the universe is made of information; matter and energy are only incidental. 39

EOD