Information and Entropy. Professor Kevin Gold

Similar documents
Lecture 1: Shannon s Theorem

6.02 Fall 2012 Lecture #1

! Where are we on course map? ! What we did in lab last week. " How it relates to this week. ! Compression. " What is it, examples, classifications

An introduction to basic information theory. Hampus Wessman

Source Coding. Master Universitario en Ingeniería de Telecomunicación. I. Santamaría Universidad de Cantabria

( c ) E p s t e i n, C a r t e r a n d B o l l i n g e r C h a p t e r 1 7 : I n f o r m a t i o n S c i e n c e P a g e 1

Lecture 10 : Basic Compression Algorithms

Bandwidth: Communicate large complex & highly detailed 3D models through lowbandwidth connection (e.g. VRML over the Internet)

Intro to Information Theory

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

A Mathematical Theory of Communication

1. Basics of Information

Entropy as a measure of surprise

Information & Correlation

Shannon-Fano-Elias coding

Multimedia. Multimedia Data Compression (Lossless Compression Algorithms)

17.1 Binary Codes Normal numbers we use are in base 10, which are called decimal numbers. Each digit can be 10 possible numbers: 0, 1, 2, 9.

Lecture 1 : Data Compression and Entropy

Classification & Information Theory Lecture #8

Lecture 7: DecisionTrees

2018/5/3. YU Xiangyu

Image and Multidimensional Signal Processing

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

4. Quantization and Data Compression. ECE 302 Spring 2012 Purdue University, School of ECE Prof. Ilya Pollak

UNIT I INFORMATION THEORY. I k log 2

Information Theory (Information Theory by J. V. Stone, 2015)

Dept. of Linguistics, Indiana University Fall 2015

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Kotebe Metropolitan University Department of Computer Science and Technology Multimedia (CoSc 4151)

CSCI 2570 Introduction to Nanocomputing

Chapter 2: Source coding

log 2 N I m m log 2 N + 1 m.

3F1 Information Theory, Lecture 3

Entropy. Probability and Computing. Presentation 22. Probability and Computing Presentation 22 Entropy 1/39

3F1 Information Theory, Lecture 3

CS4800: Algorithms & Data Jonathan Ullman

Lecture 1: September 25, A quick reminder about random variables and convexity

CMPT 365 Multimedia Systems. Lossless Compression

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments

CS1800: Hex & Logic. Professor Kevin Gold

CS1800: Mathematical Induction. Professor Kevin Gold

DISCRETE HAAR WAVELET TRANSFORMS

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Welcome to Comp 411! 2) Course Objectives. 1) Course Mechanics. 3) Information. I thought this course was called Computer Organization

Lecture 2: Introduction to Audio, Video & Image Coding Techniques (I) -- Fundaments. Tutorial 1. Acknowledgement and References for lectures 1 to 5

Information Theory and Statistics Lecture 2: Source coding

Run-length & Entropy Coding. Redundancy Removal. Sampling. Quantization. Perform inverse operations at the receiver EEE

Autumn Coping with NP-completeness (Conclusion) Introduction to Data Compression

Simple Interactions CS 105 Lecture 2 Jan 26. Matthew Stone

Entropy Coding. Connectivity coding. Entropy coding. Definitions. Lossles coder. Input: a set of symbols Output: bitstream. Idea

Compression and Coding

Digital communication system. Shannon s separation principle

PHYS Statistical Mechanics I Assignment 4 Solutions

MITOCW watch?v=vjzv6wjttnc

Administrative notes. Computational Thinking ct.cs.ubc.ca

CSEP 521 Applied Algorithms Spring Statistical Lossless Data Compression

4.8 Huffman Codes. These lecture slides are supplied by Mathijs de Weerd

CSE 421 Greedy: Huffman Codes

Shannon's Theory of Communication

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

Chapter 3 Source Coding. 3.1 An Introduction to Source Coding 3.2 Optimal Source Codes 3.3 Shannon-Fano Code 3.4 Huffman Code

Mathematical Foundations of Computer Science Lecture Outline October 18, 2018

Huffman Coding. C.M. Liu Perceptual Lab, College of Computer Science National Chiao-Tung University

Kolmogorov complexity ; induction, prediction and compression

Coding of memoryless sources 1/35

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Implementation of Lossless Huffman Coding: Image compression using K-Means algorithm and comparison vs. Random numbers and Message source

BASIC COMPRESSION TECHNIQUES

DCSP-3: Minimal Length Coding. Jianfeng Feng

Information Theory, Statistics, and Decision Trees

Chapter 2 Date Compression: Source Coding. 2.1 An Introduction to Source Coding 2.2 Optimal Source Codes 2.3 Huffman Code

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

Number Representation and Waveform Quantization

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Decision Trees. CSC411/2515: Machine Learning and Data Mining, Winter 2018 Luke Zettlemoyer, Carlos Guestrin, and Andrew Moore

Ch 0 Introduction. 0.1 Overview of Information Theory and Coding

Real-Time Audio and Video

Image Data Compression

MATH2206 Prob Stat/20.Jan Weekly Review 1-2

Murray Gell-Mann, The Quark and the Jaguar, 1995

Classical Information Theory Notes from the lectures by prof Suhov Trieste - june 2006

Digital Image Processing Lectures 25 & 26

Classification and Regression Trees

Compression. Reality Check 11 on page 527 explores implementation of the MDCT into a simple, working algorithm to compress audio.

MACHINE LEARNING INTRODUCTION: STRING CLASSIFICATION

COMM901 Source Coding and Compression. Quiz 1

Examples MAT-INF1100. Øyvind Ryan

Information Theory and Coding Techniques

CIS 2033 Lecture 5, Fall

Randomized Decision Trees

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

NP-Completeness I. Lecture Overview Introduction: Reduction and Expressiveness

Discrete Mathematics and Probability Theory Fall 2010 Tse/Wagner MT 2 Soln

Digital Systems Roberto Muscedere Images 2013 Pearson Education Inc. 1

Lecture 17: Trees and Merge Sort 10:00 AM, Oct 15, 2018

God doesn t play dice. - Albert Einstein

We are here. Assembly Language. Processors Arithmetic Logic Units. Finite State Machines. Circuits Gates. Transistors

Formalizing Probability. Choosing the Sample Space. Probability Measures

Optimal codes - I. A code is optimal if it has the shortest codeword length L. i i. This can be seen as an optimization problem. min.

STA Module 4 Probability Concepts. Rev.F08 1

Transcription:

Information and Entropy Professor Kevin Gold

What s Information? Informally, when I communicate a message to you, that s information. Your grade is 100/100 Information can be encoded as a signal. Words on the page Sound wave in the air Bits on a hard drive, in RAM, or in an Internet packet

Same Information, More or Less Signal I can be more or less wordy while communicating the same information. Your grade on the midterm is 100/100. Your grade is 100/100 100 + The number of possible messages constrains how short the message can be. I can get away with the shortest code, +, only if there are few possible grades ( +,, -) If there s no context, I need to preface by specifying on the midterm or the meaning is ambiguous

Using Fixed-Length Binary Codes In addition to using bits to represent numbers in binary, we can use bits to represent arbitrary messages. Suppose we have N messages we d like to send. We can assign binary numbers to represent these messages: 000 = Hello, 001 = Goodbye, 010=Send Help, 011=I m fine, 100=Thanks, 101=I m trapped in a computer 000101010001 = 000 101 010 001 How many messages do we need? Enough to have a code for each message. We get 2 b codes with b bits, so we need log 2 N bits for N messages. More possible messages requires more bits. 8 bits = up to 2 8 = 256 messages

Sample Fixed-Length Codes: ASCII and the Unicode BMP ASCII is still often used to encode characters at a flat rate of 8 bits per character 01000001 = A, 01000010 = B, 00100100 = $, 00100101 = %, 00110000 = 0, 00110001 = 1, Unicode can use more than 16 bits if necessary, but the codes in the 16-bit basic multilingual plane (BMP) can handle a wide variety of international characters, as well as things like mathematical notation and emoji Hex for Unicode codes (x varies by column)

Why Fixed-Length? Avoiding Ambiguity Fixed-length codes are a convenient way to know where one code ends and another begins. 0100001001000001 = BA in ASCII; we know the cut is at 8 bits Suppose our codes from the previous example went: 0 = Hello, 1 = Goodbye, 10=Send Help, 11=I m fine, 100=Thanks, 101=I m trapped in a computer Is 0101 Hello, Goodbye, Hello, Goodbye or Hello, I m trapped in a computer? We can t tell here

An Advantage of Variable- Length Codes: Compression Codes like ASCII assign the same number of bits regardless of whether the character is common ( e, 01100101) or uncommon ( ~, 01111110) But we can assign variable-length codes to symbols where more common symbols get shorter codes and end up sending fewer bits overall Example: Compress AAAAAAAABDCB Code 1: 00 = A, 01 = B, 10 = C, 11 = D Bitstring is 00 00 00 00 00 00 00 00 01 11 10 01 (2 bits) Code 2: 0 = A, 10 = B, 110 = C, 111 = D Bitstring is 0 0 0 0 0 0 0 0 10 111 110 10 (18 bits)

The Prefix Property Ambiguity arises when a decoder doesn t know whether to accept a code as done or keep reading more But we can arrange for no code to be the beginning (prefix) of any other code. This is called the prefix property. Codes that obey the prefix property don t need spaces or separating symbols to be sent between codes to understand what is being sent. Decoding 000000001011111010 0 = A, 10 = B, 110 = C, 111 = D (obeys prefix property) 0 is the start of no symbol but A. First char is A. Same logic applies to next 7 symbols. They must each be A. Read 1, could be 3 codes. Read 0: 10 is a unique start. It s B. [etc]

Morse Code Doesn t Have the Prefix Property Invented before we understood codes well, Morse code is ambiguous without pauses between symbols (compare A to ET ) On a computer, an extra symbol or sequence to denote the next character would waste space; here, the pause wasted time

Codes that Obey the Prefix Property Can Be Modeled With Trees A 0 1 B 0 1 0 C 1 D 000000001011111010 Left->A (x8) Right, Left->B Right, Right, Right->D [ ] Every variable-length code that obeys the prefix property can be modeled as a binary tree (tree with a root node and at most two children per node) Edges are labeled 0 or 1, and the leaves are labeled with symbols to encode Decode each character by following labeled branches from the root This must obey the prefix property as long as symbols are only at leaves

Optimal Codes Require a Number of Bits Related to their Probability Compression is optimal if it uses as few bits as possible to represent the same message. Variable-length codes such as the one we just showed can be optimal as long as they are organized in the right way according to character frequency. There is an algorithm to do this: Huffman coding (which automatically creates a tree from character counts) There is an interesting property of the number of bits used per symbol in this compression - it can t be less than -log 2 p where p is the probability (frequency) of the character.

Huffman Coding Example We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) First, create 1 node per character (these are our trees) a 16 b c d 2 e 1 f 1 g

Huffman Coding Example Now, join the two trees with the smallest counts 2 a 16 b c d 2 e 1 f 1 g

Huffman Coding Example Now, join the two trees with the smallest counts And repeat d 2 2 a 16 b c e 1 f 1 g

Huffman Coding Example Now, join the two trees with the smallest counts And repeat 8 d 2 2 a 16 b c e 1 f 1 g

Huffman Coding Example Now, join the two trees with the smallest counts And repeat 8 g 8 d 2 2 a 16 b c e 1 f 1

Huffman Coding Example 16 8 8 b c g d 2 2 a 16 e 1 f 1

Huffman Coding Example 32 a 16 16 8 8 b c g d 2 2 e 1 f 1

Huffman Coding Example 32 0 1 a 16 16 0 1 8 0 1 0 1 8 b c g 0 1 d 2 2 0 1 e 1 f 1

Huffman Coding Example a 16 32 0 1 b 8 c 16 0 1 0 1 0 1 g 8 0 1 a = 0 b = 100 c = 101 d = 1110 e = 11110 f = 11111 g = 110 d 2 2 0 1 e 1 f 1

Example of Optimal Symbol Compression We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) Pr(a) = 16/32 = 1/2 -log2 (1/2) = -(-1) = 1 bit Pr(b) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits Pr(c) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits Pr(d) = 2/32 = 1/16 -log2 (1/16) = -(-) = bits Pr(e) = 1/32 -log2 (1/32) = -(-5) = 5 bits Pr(f) = 1/32 -log2 (1/32) = -(-5) = 5 bits Pr(g) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits

Example of Optimal Symbol Compression We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) Pr(a) = 16/32 = 1/2 -log 2 (1/2) = -(-1) = 1 Pr(b) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 Pr(c) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 Pr(d) = 2/32 = 1/16 -log 2 (1/16) = -(-) = Pr(e) = 1/32 -log 2 (1/32) = -(-5) = 5 Pr(f) = 1/32 -log 2 (1/32) = -(-5) = 5 Pr(g) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 a b c g a = 0 b = 100 c = 101 d = 1110 e = 11110 f = 11111 g = 110 d e f

MPEG Coding Uses Variable-Length Codes To compress video and audio, the MPEG standard gives variable-length codes for features such as brightness, amount of movement in part of the scene, and texture Some values are more common than others, and the more common ones get shorter codes Flat textures (fewer bits) vs busy textures (more bits) No movement (fewer bits) vs. movement (more bits) No change from previous video frame (no bits) vs. change (more bits) ffmpeg debug showing motion vectors

JPEG is similar, leading to predictable differences in file size 17,67 bytes 322,17 bytes due to flat or regular textures unpredictable texture (both 102x768 jpg s)

Why Not Use Variable-Length Codes (VLCs)? Though VLCs save memory, there s no way to get random access to a code down the line in constant time Like linked lists, it takes linear time to access a code - by decoding all the codes before it To get random access with fixed-length codes, we can use the same trick as arrays - multiply index by code length to get the place to look for a particular code For this reason, VLCs tend to just be used for compression, and files are decompressed in a single linear-time pass before being used Most common data types have fixed lengths as speed of access is more important than memory use

Getting Used to -log2 p If N symbols are equally likely, they have each have probability p = 1/N. There s nothing to be gained from variable-length codes in this case - it s all symmetric - so we d assign the same number of bits to everything. We determined that we need log 2 N bits to represent all N codes. If we drop the ceiling and replace N with 1/p, we have log 2 (1/p) = log 2 p -1 = -log 2 p. Thus, plugging in probabilities that correspond to equally likely outcomes will produce a reasonable number of bits. -log 2 (1/2) = 1 bit for a coin flip. -log 2 (1/) = 2 bits for possible values, 00, 01, 10, 11. -log 2 (1/256) = 8 bits, and so on.

A Mathematical Definition of Information The -log 2 p bound on the number of bits doesn t depend on what kind of thing we re talking about - it s something we can calculate about any event with a probability. For any event, we can ask, How many bits should we use to talk about this event, in an optimal encoding? This quantity is interesting because it is smaller when events are unsurprising (high p), and larger when events are surprising (low p) This matches our everyday use of the idea of being informative enough that this quantity gained the name, the information of an event. If p is the probability of an event, -log 2 p is its information. Even outside a computer context, information is measured in bits.

In This Sense, the Image on the Right Has More Information 17,67 bytes 322,17 bytes Information here doesn t imply that the message is interesting - just that it had low probability and therefore takes more bits to encode. Because the image on the right is more unpredictable, it takes more bits to reproduce it exactly. If a source of information is always extremely unpredictable, we say it has high entropy, to be defined more formally next.

Entropy In information theory, entropy is the expected information for a source of symbols. Entropy is E[I(X)] where X is an event and I(X) is its information I(X) = -log 2 Pr(X) Applying the definition of expectation, this is Σ X -Pr(X)log 2 Pr(X) It simultaneously represents: How many bits we need to use on average to encode the stream of symbols How unpredictable each symbol is, on average

Entropy Examples Series of coin flips: ABABAABBAB Pr(A) = 1/2, Pr(B) = 1/2 Entropy = -1/2 log 2 (1/2) - 1/2 log 2 (1/2) = -1/2(-1) - 1/2(-1) = 1/2 + 1/2 = 1 bit. (H = 0, T = 1) Series of rolls of a -sided die: 1,2,,1,3,2,,3 Pr( 1 ) = Pr( 2 ) = Pr( 3 ) = Pr( ) = 1/ -(1/) log 2 (1/) = 1/ (2) = 1/2, so 1/2 + 1/2 + 1/2 + 1/2 = 2 bits (00, 01, 10, 11)

Entropy Examples Pr( I am Groot ) = 1-1 log 2 (1) = -1(0) = 0 bits Although the entropy could be higher if we considered that Groot could speak or not Unfair -sided die: rolls 1/2 of the time, 3 1/ of the time, 1-2 1/8 of the time each - 1/2 log 2 1/2-1/ log 2 1/ - 1/8 log 2 1/8-1/8 log 2 1/8 = -1/2(-1) - 1/(-2) - 1/8(-3) - 1/8(-3) = 1/2 + 1/2 + 3/8 + 3/8 = 1.75 Notice that this is less than the fair die; this die is more predictable

Entropy Examples 256 equally likely characters: Σ -1/256 log 2 1/256 = log 2 256 = 8 bits. If we really have no way of predicting what s coming next, we need all 8 bits every time. 256 characters where lowercase letters and digits (36) are used with equal probability 60% of the time; capital letters and 10 special characters (36) are used with equal probability 30% of the time, and the remaining characters split the remaining 10%: Pr (any particular lower or digit) = 0.6*1/36 = 0.0167 Pr (any particular upper or special char) = 0.3 *1/36 = 0.0083 Pr (other) = 0.1*1/(256-72) = 0.1/18 = 0.0005 36 (-0.0167 log 2 0.0167) + 36 (-0.0083 log 2 0.0083) + 18(-0.0005 log 2 0.0005) = 6.69 bits under optimal encoding not that big a deal, really

Quick Check: Entropy Calculate the entropy for the following sequence of characters (assuming the observed frequencies reflect the underlying probabilities): ABCA (To get you started, notice Pr(A) = 2/ = 1/2 and that its term is -(1/2)log2(1/2) = (1/2)(1) = 1/2.)

Quick Check: Entropy Calculate the entropy for the following sequence of characters: ABCA Pr(A) = 1/2 -(1/2)log 2 (1/2) = (1/2)(1) = 1/2 Pr(B) = 1/ -(1/)log 2 (1/) = (1/)(2) = 1/2 Pr(C) identical to Pr(B), so 1/2 + 1/2 + 1/2 = 1.5 Sanity check: makes sense because 0, 10, 11 uses 1 bit half the time and 2 bits half the time

Physical Entropy Informational entropy, definted in the 190 s by Claude Shannon, gets its name from physical entropy, defined in the 19th century in physics (specifically thermodynamics) In physics, entropy refers to how unpredictable a physical system is; the equation is k B Σ X Pr(X) ln Pr(X) where X are physical events and k B is a physical constant we don t need to discuss you can see the resemblance to Shannon s concept In physics, entropy is associated with heat, since higher temperature leads to higher unpredictability of the particles If describing physical events, these quantities only differ by a constant

Information Gain Just as learning new information can change our perception of a probability, it can also reduce entropy For example, we could learn that the next symbol was definitely not an A. This would reduce our surprise when we get a B or a C. When we learn information that rules out particular possibilities or changes the probability distribution, we can calculate a new entropy. The difference in entropies is called the information gain.

Information Gain Example We re on an assembly line with appliances coming down the pipe: 1/2 dishwashers 1/ dryers 1/ stoves The current entropy is 1/2(1) + 1/(2) + 1/(2) = 1.5 bits. If we recorded the sequence, it would take 1.5 bits on average if 0 = dishwasher, 10 = dryer, 11 = stove The message comes down the line: No more dishwashers today! New probabilities: 1/2 dryer, 1/2 stove. The new entropy is (1/2)(1) + (1/2)(1) = 1 bit. We could encode what happens now using just 0 = dryer, 1 = stove. The information gain of no dishwashers is 1.5-1 = 0.5 bits.

Information Gain on an Image It s common for uncompressed images to represent pixels as triples (R,G,B) where each value is 0-255 If we treat all colors as equally likely, entropy of the pixel values is -log 2 (1/(256*256*256)) = -log 2 (1/2 2 ) = 2 bits, precisely the bits needed to send each pixel We skipped summing over 2 2 values and dividing by 2 2 ; these operations cancel out Suppose we learn that the image is grayscale, meaning all pixels are (x, x, x) for some x in [0,255]. The new entropy is -log 2 1/256 = 8 bits (again the true bits/pixel) The information gain is thus 16 bits - a measure of the savings per pixel

Information Gain in Machine Learning Some classifiers try to focus on features that have information gain relative to the training example classifications Example: If classifying images as octopus or not, a Yes is less surprising once you know it has 8 arms Before (base rate) Octopus 0.25 Not Octopus 0.75 8 arms < 8 arms Octopus 1 Not Octopus 0 Both piles have lower entropy Octopus 0.05 Not Octopus 0.95

Information Gain in Machine Learning Worse features will tend to do little or nothing to reduce the entropy, and these can be ignored in the classification orange Octopus 0.2 Before (base rate) Octopus 0.25 Not Octopus 0.75 not orange Not Octopus 0.8 Little or no information gain Octopus 0.3 Not Octopus 0.7

Entropy and Steganography Steganography is the hiding of information in plain sight One way cyberattacks can occur (or messages transmitted) is by embedding payload bits into the low-order bits of innocuous images, videos, or PDFs Examining the least significant bits for high entropy (unpredictability) can reveal that something is not right; the payload must be compressed, creating abnormally high entropy https://securelist.com/steganography-in-contemporary-cyberattacks/79276/

Summary Information for a particular event can be calculated directly from its probability p: -log 2 p. This is the lower bound on the number of bits necessary to encode this event 2 equiprobable events => 1 bit each; equiprobable => 2 bits When p varies among symbols, we can take advantage of this to create variable-length codes (using Huffman coding) that use the optimal number of bits without becoming ambiguous The entropy of a stream of symbols is the expected information - a measure of the average number of bits we need and how unpredictable or surprising the source is