Information and Entropy Professor Kevin Gold
What s Information? Informally, when I communicate a message to you, that s information. Your grade is 100/100 Information can be encoded as a signal. Words on the page Sound wave in the air Bits on a hard drive, in RAM, or in an Internet packet
Same Information, More or Less Signal I can be more or less wordy while communicating the same information. Your grade on the midterm is 100/100. Your grade is 100/100 100 + The number of possible messages constrains how short the message can be. I can get away with the shortest code, +, only if there are few possible grades ( +,, -) If there s no context, I need to preface by specifying on the midterm or the meaning is ambiguous
Using Fixed-Length Binary Codes In addition to using bits to represent numbers in binary, we can use bits to represent arbitrary messages. Suppose we have N messages we d like to send. We can assign binary numbers to represent these messages: 000 = Hello, 001 = Goodbye, 010=Send Help, 011=I m fine, 100=Thanks, 101=I m trapped in a computer 000101010001 = 000 101 010 001 How many messages do we need? Enough to have a code for each message. We get 2 b codes with b bits, so we need log 2 N bits for N messages. More possible messages requires more bits. 8 bits = up to 2 8 = 256 messages
Sample Fixed-Length Codes: ASCII and the Unicode BMP ASCII is still often used to encode characters at a flat rate of 8 bits per character 01000001 = A, 01000010 = B, 00100100 = $, 00100101 = %, 00110000 = 0, 00110001 = 1, Unicode can use more than 16 bits if necessary, but the codes in the 16-bit basic multilingual plane (BMP) can handle a wide variety of international characters, as well as things like mathematical notation and emoji Hex for Unicode codes (x varies by column)
Why Fixed-Length? Avoiding Ambiguity Fixed-length codes are a convenient way to know where one code ends and another begins. 0100001001000001 = BA in ASCII; we know the cut is at 8 bits Suppose our codes from the previous example went: 0 = Hello, 1 = Goodbye, 10=Send Help, 11=I m fine, 100=Thanks, 101=I m trapped in a computer Is 0101 Hello, Goodbye, Hello, Goodbye or Hello, I m trapped in a computer? We can t tell here
An Advantage of Variable- Length Codes: Compression Codes like ASCII assign the same number of bits regardless of whether the character is common ( e, 01100101) or uncommon ( ~, 01111110) But we can assign variable-length codes to symbols where more common symbols get shorter codes and end up sending fewer bits overall Example: Compress AAAAAAAABDCB Code 1: 00 = A, 01 = B, 10 = C, 11 = D Bitstring is 00 00 00 00 00 00 00 00 01 11 10 01 (2 bits) Code 2: 0 = A, 10 = B, 110 = C, 111 = D Bitstring is 0 0 0 0 0 0 0 0 10 111 110 10 (18 bits)
The Prefix Property Ambiguity arises when a decoder doesn t know whether to accept a code as done or keep reading more But we can arrange for no code to be the beginning (prefix) of any other code. This is called the prefix property. Codes that obey the prefix property don t need spaces or separating symbols to be sent between codes to understand what is being sent. Decoding 000000001011111010 0 = A, 10 = B, 110 = C, 111 = D (obeys prefix property) 0 is the start of no symbol but A. First char is A. Same logic applies to next 7 symbols. They must each be A. Read 1, could be 3 codes. Read 0: 10 is a unique start. It s B. [etc]
Morse Code Doesn t Have the Prefix Property Invented before we understood codes well, Morse code is ambiguous without pauses between symbols (compare A to ET ) On a computer, an extra symbol or sequence to denote the next character would waste space; here, the pause wasted time
Codes that Obey the Prefix Property Can Be Modeled With Trees A 0 1 B 0 1 0 C 1 D 000000001011111010 Left->A (x8) Right, Left->B Right, Right, Right->D [ ] Every variable-length code that obeys the prefix property can be modeled as a binary tree (tree with a root node and at most two children per node) Edges are labeled 0 or 1, and the leaves are labeled with symbols to encode Decode each character by following labeled branches from the root This must obey the prefix property as long as symbols are only at leaves
Optimal Codes Require a Number of Bits Related to their Probability Compression is optimal if it uses as few bits as possible to represent the same message. Variable-length codes such as the one we just showed can be optimal as long as they are organized in the right way according to character frequency. There is an algorithm to do this: Huffman coding (which automatically creates a tree from character counts) There is an interesting property of the number of bits used per symbol in this compression - it can t be less than -log 2 p where p is the probability (frequency) of the character.
Huffman Coding Example We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) First, create 1 node per character (these are our trees) a 16 b c d 2 e 1 f 1 g
Huffman Coding Example Now, join the two trees with the smallest counts 2 a 16 b c d 2 e 1 f 1 g
Huffman Coding Example Now, join the two trees with the smallest counts And repeat d 2 2 a 16 b c e 1 f 1 g
Huffman Coding Example Now, join the two trees with the smallest counts And repeat 8 d 2 2 a 16 b c e 1 f 1 g
Huffman Coding Example Now, join the two trees with the smallest counts And repeat 8 g 8 d 2 2 a 16 b c e 1 f 1
Huffman Coding Example 16 8 8 b c g d 2 2 a 16 e 1 f 1
Huffman Coding Example 32 a 16 16 8 8 b c g d 2 2 e 1 f 1
Huffman Coding Example 32 0 1 a 16 16 0 1 8 0 1 0 1 8 b c g 0 1 d 2 2 0 1 e 1 f 1
Huffman Coding Example a 16 32 0 1 b 8 c 16 0 1 0 1 0 1 g 8 0 1 a = 0 b = 100 c = 101 d = 1110 e = 11110 f = 11111 g = 110 d 2 2 0 1 e 1 f 1
Example of Optimal Symbol Compression We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) Pr(a) = 16/32 = 1/2 -log2 (1/2) = -(-1) = 1 bit Pr(b) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits Pr(c) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits Pr(d) = 2/32 = 1/16 -log2 (1/16) = -(-) = bits Pr(e) = 1/32 -log2 (1/32) = -(-5) = 5 bits Pr(f) = 1/32 -log2 (1/32) = -(-5) = 5 bits Pr(g) = /32 = 1/8 -log2 (1/8) = -(-3) = 3 bits
Example of Optimal Symbol Compression We have a file with: 16 a s, b s, c s, 2 d s, 1 e, 1 f, g s (32 total) Pr(a) = 16/32 = 1/2 -log 2 (1/2) = -(-1) = 1 Pr(b) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 Pr(c) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 Pr(d) = 2/32 = 1/16 -log 2 (1/16) = -(-) = Pr(e) = 1/32 -log 2 (1/32) = -(-5) = 5 Pr(f) = 1/32 -log 2 (1/32) = -(-5) = 5 Pr(g) = /32 = 1/8 -log 2 (1/8) = -(-3) = 3 a b c g a = 0 b = 100 c = 101 d = 1110 e = 11110 f = 11111 g = 110 d e f
MPEG Coding Uses Variable-Length Codes To compress video and audio, the MPEG standard gives variable-length codes for features such as brightness, amount of movement in part of the scene, and texture Some values are more common than others, and the more common ones get shorter codes Flat textures (fewer bits) vs busy textures (more bits) No movement (fewer bits) vs. movement (more bits) No change from previous video frame (no bits) vs. change (more bits) ffmpeg debug showing motion vectors
JPEG is similar, leading to predictable differences in file size 17,67 bytes 322,17 bytes due to flat or regular textures unpredictable texture (both 102x768 jpg s)
Why Not Use Variable-Length Codes (VLCs)? Though VLCs save memory, there s no way to get random access to a code down the line in constant time Like linked lists, it takes linear time to access a code - by decoding all the codes before it To get random access with fixed-length codes, we can use the same trick as arrays - multiply index by code length to get the place to look for a particular code For this reason, VLCs tend to just be used for compression, and files are decompressed in a single linear-time pass before being used Most common data types have fixed lengths as speed of access is more important than memory use
Getting Used to -log2 p If N symbols are equally likely, they have each have probability p = 1/N. There s nothing to be gained from variable-length codes in this case - it s all symmetric - so we d assign the same number of bits to everything. We determined that we need log 2 N bits to represent all N codes. If we drop the ceiling and replace N with 1/p, we have log 2 (1/p) = log 2 p -1 = -log 2 p. Thus, plugging in probabilities that correspond to equally likely outcomes will produce a reasonable number of bits. -log 2 (1/2) = 1 bit for a coin flip. -log 2 (1/) = 2 bits for possible values, 00, 01, 10, 11. -log 2 (1/256) = 8 bits, and so on.
A Mathematical Definition of Information The -log 2 p bound on the number of bits doesn t depend on what kind of thing we re talking about - it s something we can calculate about any event with a probability. For any event, we can ask, How many bits should we use to talk about this event, in an optimal encoding? This quantity is interesting because it is smaller when events are unsurprising (high p), and larger when events are surprising (low p) This matches our everyday use of the idea of being informative enough that this quantity gained the name, the information of an event. If p is the probability of an event, -log 2 p is its information. Even outside a computer context, information is measured in bits.
In This Sense, the Image on the Right Has More Information 17,67 bytes 322,17 bytes Information here doesn t imply that the message is interesting - just that it had low probability and therefore takes more bits to encode. Because the image on the right is more unpredictable, it takes more bits to reproduce it exactly. If a source of information is always extremely unpredictable, we say it has high entropy, to be defined more formally next.
Entropy In information theory, entropy is the expected information for a source of symbols. Entropy is E[I(X)] where X is an event and I(X) is its information I(X) = -log 2 Pr(X) Applying the definition of expectation, this is Σ X -Pr(X)log 2 Pr(X) It simultaneously represents: How many bits we need to use on average to encode the stream of symbols How unpredictable each symbol is, on average
Entropy Examples Series of coin flips: ABABAABBAB Pr(A) = 1/2, Pr(B) = 1/2 Entropy = -1/2 log 2 (1/2) - 1/2 log 2 (1/2) = -1/2(-1) - 1/2(-1) = 1/2 + 1/2 = 1 bit. (H = 0, T = 1) Series of rolls of a -sided die: 1,2,,1,3,2,,3 Pr( 1 ) = Pr( 2 ) = Pr( 3 ) = Pr( ) = 1/ -(1/) log 2 (1/) = 1/ (2) = 1/2, so 1/2 + 1/2 + 1/2 + 1/2 = 2 bits (00, 01, 10, 11)
Entropy Examples Pr( I am Groot ) = 1-1 log 2 (1) = -1(0) = 0 bits Although the entropy could be higher if we considered that Groot could speak or not Unfair -sided die: rolls 1/2 of the time, 3 1/ of the time, 1-2 1/8 of the time each - 1/2 log 2 1/2-1/ log 2 1/ - 1/8 log 2 1/8-1/8 log 2 1/8 = -1/2(-1) - 1/(-2) - 1/8(-3) - 1/8(-3) = 1/2 + 1/2 + 3/8 + 3/8 = 1.75 Notice that this is less than the fair die; this die is more predictable
Entropy Examples 256 equally likely characters: Σ -1/256 log 2 1/256 = log 2 256 = 8 bits. If we really have no way of predicting what s coming next, we need all 8 bits every time. 256 characters where lowercase letters and digits (36) are used with equal probability 60% of the time; capital letters and 10 special characters (36) are used with equal probability 30% of the time, and the remaining characters split the remaining 10%: Pr (any particular lower or digit) = 0.6*1/36 = 0.0167 Pr (any particular upper or special char) = 0.3 *1/36 = 0.0083 Pr (other) = 0.1*1/(256-72) = 0.1/18 = 0.0005 36 (-0.0167 log 2 0.0167) + 36 (-0.0083 log 2 0.0083) + 18(-0.0005 log 2 0.0005) = 6.69 bits under optimal encoding not that big a deal, really
Quick Check: Entropy Calculate the entropy for the following sequence of characters (assuming the observed frequencies reflect the underlying probabilities): ABCA (To get you started, notice Pr(A) = 2/ = 1/2 and that its term is -(1/2)log2(1/2) = (1/2)(1) = 1/2.)
Quick Check: Entropy Calculate the entropy for the following sequence of characters: ABCA Pr(A) = 1/2 -(1/2)log 2 (1/2) = (1/2)(1) = 1/2 Pr(B) = 1/ -(1/)log 2 (1/) = (1/)(2) = 1/2 Pr(C) identical to Pr(B), so 1/2 + 1/2 + 1/2 = 1.5 Sanity check: makes sense because 0, 10, 11 uses 1 bit half the time and 2 bits half the time
Physical Entropy Informational entropy, definted in the 190 s by Claude Shannon, gets its name from physical entropy, defined in the 19th century in physics (specifically thermodynamics) In physics, entropy refers to how unpredictable a physical system is; the equation is k B Σ X Pr(X) ln Pr(X) where X are physical events and k B is a physical constant we don t need to discuss you can see the resemblance to Shannon s concept In physics, entropy is associated with heat, since higher temperature leads to higher unpredictability of the particles If describing physical events, these quantities only differ by a constant
Information Gain Just as learning new information can change our perception of a probability, it can also reduce entropy For example, we could learn that the next symbol was definitely not an A. This would reduce our surprise when we get a B or a C. When we learn information that rules out particular possibilities or changes the probability distribution, we can calculate a new entropy. The difference in entropies is called the information gain.
Information Gain Example We re on an assembly line with appliances coming down the pipe: 1/2 dishwashers 1/ dryers 1/ stoves The current entropy is 1/2(1) + 1/(2) + 1/(2) = 1.5 bits. If we recorded the sequence, it would take 1.5 bits on average if 0 = dishwasher, 10 = dryer, 11 = stove The message comes down the line: No more dishwashers today! New probabilities: 1/2 dryer, 1/2 stove. The new entropy is (1/2)(1) + (1/2)(1) = 1 bit. We could encode what happens now using just 0 = dryer, 1 = stove. The information gain of no dishwashers is 1.5-1 = 0.5 bits.
Information Gain on an Image It s common for uncompressed images to represent pixels as triples (R,G,B) where each value is 0-255 If we treat all colors as equally likely, entropy of the pixel values is -log 2 (1/(256*256*256)) = -log 2 (1/2 2 ) = 2 bits, precisely the bits needed to send each pixel We skipped summing over 2 2 values and dividing by 2 2 ; these operations cancel out Suppose we learn that the image is grayscale, meaning all pixels are (x, x, x) for some x in [0,255]. The new entropy is -log 2 1/256 = 8 bits (again the true bits/pixel) The information gain is thus 16 bits - a measure of the savings per pixel
Information Gain in Machine Learning Some classifiers try to focus on features that have information gain relative to the training example classifications Example: If classifying images as octopus or not, a Yes is less surprising once you know it has 8 arms Before (base rate) Octopus 0.25 Not Octopus 0.75 8 arms < 8 arms Octopus 1 Not Octopus 0 Both piles have lower entropy Octopus 0.05 Not Octopus 0.95
Information Gain in Machine Learning Worse features will tend to do little or nothing to reduce the entropy, and these can be ignored in the classification orange Octopus 0.2 Before (base rate) Octopus 0.25 Not Octopus 0.75 not orange Not Octopus 0.8 Little or no information gain Octopus 0.3 Not Octopus 0.7
Entropy and Steganography Steganography is the hiding of information in plain sight One way cyberattacks can occur (or messages transmitted) is by embedding payload bits into the low-order bits of innocuous images, videos, or PDFs Examining the least significant bits for high entropy (unpredictability) can reveal that something is not right; the payload must be compressed, creating abnormally high entropy https://securelist.com/steganography-in-contemporary-cyberattacks/79276/
Summary Information for a particular event can be calculated directly from its probability p: -log 2 p. This is the lower bound on the number of bits necessary to encode this event 2 equiprobable events => 1 bit each; equiprobable => 2 bits When p varies among symbols, we can take advantage of this to create variable-length codes (using Huffman coding) that use the optimal number of bits without becoming ambiguous The entropy of a stream of symbols is the expected information - a measure of the average number of bits we need and how unpredictable or surprising the source is