What is Entropy? Jeff Gill, 1 Entropy in Information Theory

Similar documents
Information in Biology

CHAPTER 4 ENTROPY AND INFORMATION

Information in Biology

An Introduction to Information Theory: Notes

Lecture 11: Information theory THURSDAY, FEBRUARY 21, 2019

INTRODUCTION TO INFORMATION THEORY

The Measure of Information Uniqueness of the Logarithmic Uncertainty Measure

BOLTZMANN-GIBBS ENTROPY: AXIOMATIC CHARACTERIZATION AND APPLICATION

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Lecture 1: Shannon s Theorem

BOLTZMANN ENTROPY: PROBABILITY AND INFORMATION

PERFECT SECRECY AND ADVERSARIAL INDISTINGUISHABILITY

Introduction to Information Theory. Uncertainty. Entropy. Surprisal. Joint entropy. Conditional entropy. Mutual information.

Dept. of Linguistics, Indiana University Fall 2015

Information Theory in Intelligent Decision Making

A Mathematical Theory of Communication

Asymptotic redundancy and prolixity

Quantitative Biology Lecture 3

Murray Gell-Mann, The Quark and the Jaguar, 1995

Mutual Information & Genotype-Phenotype Association. Norman MacDonald January 31, 2011 CSCI 4181/6802

6.3 Bernoulli Trials Example Consider the following random experiments

arxiv: v1 [physics.class-ph] 22 Dec 2010

Physics, Time and Determinism

Part I. Entropy. Information Theory and Networks. Section 1. Entropy: definitions. Lecture 5: Entropy

An introduction to basic information theory. Hampus Wessman

arxiv:math-ph/ v1 30 May 2005

Shannon's Theory of Communication

KULLBACK-LEIBLER INFORMATION THEORY A BASIS FOR MODEL SELECTION AND INFERENCE

Introduction to Information Entropy Adapted from Papoulis (1991)

arxiv: v1 [physics.data-an] 10 Sep 2007

Entropy for intuitionistic fuzzy sets

1 Introduction to information theory

CSE468 Information Conflict

Shannon Entropy: Axiomatic Characterization and Application

Phase Transitions of Random Codes and GV-Bounds

A statistical mechanical interpretation of algorithmic information theory

Optimal Polynomial Control for Discrete-Time Systems

13. The Kinetic Theory, Part 2: Objections.

With Question/Answer Animations. Chapter 7

Intro to Information Theory

3F1 Information Theory, Lecture 3

Maximum Entropy - Lecture 8

Information Theory Primer With an Appendix on Logarithms

A Simple Introduction to Information, Channel Capacity and Entropy

Chapter 2 Review of Classical Information Theory

Introduction to Information Theory

CS 256: Neural Computation Lecture Notes

Chapter I: Fundamental Information Theory

A Generalized Fuzzy Inaccuracy Measure of Order ɑ and Type β and Coding Theorems

Information Theory (Information Theory by J. V. Stone, 2015)

Entropy measures of physics via complexity

Using Information Theory to Study Efficiency and Capacity of Computers and Similar Devices

Before the Quiz. Make sure you have SIX pennies

Information Theory and Coding Techniques

Statistical Complexity of Simple 1D Spin Systems

3F1 Information Theory, Lecture 1

William Stallings Copyright 2010

Entropy Rate of Stochastic Processes

Some Concepts in Probability and Information Theory

An Entropy Bound for Random Number Generation

arxiv:physics/ v1 [physics.hist-ph] 7 Oct 1997

Formalizing Probability. Choosing the Sample Space. Probability Measures

Hugh Everett III s Many Worlds

Other Problems in Philosophy and Physics

6.02 Fall 2012 Lecture #1

MMSE Dimension. snr. 1 We use the following asymptotic notation: f(x) = O (g(x)) if and only

3.3 Introduction to Infinite Sequences and Series

Lecture 11: Quantum Information III - Source Coding

Dispersion of the Gilbert-Elliott Channel

Information Theory CHAPTER. 5.1 Introduction. 5.2 Entropy

EE229B - Final Project. Capacity-Approaching Low-Density Parity-Check Codes

INFORMATION PROCESSING ABILITY OF BINARY DETECTORS AND BLOCK DECODERS. Michael A. Lexa and Don H. Johnson

Information Theory, Statistics, and Decision Trees

Information, Utility & Bounded Rationality

Chapter 8: Introduction to Evolutionary Computation

Fuzzy directed divergence measure and its application to decision making

Automatic Differentiation Equipped Variable Elimination for Sensitivity Analysis on Probabilistic Inference Queries

Massachusetts Institute of Technology

Lecture 1: Introduction, Entropy and ML estimation

Stat Mech: Problems II

Communication Engineering Prof. Surendra Prasad Department of Electrical Engineering Indian Institute of Technology, Delhi

Bayesian Inference: What, and Why?

MULTINOMIAL AGENT S TRUST MODELING USING ENTROPY OF THE DIRICHLET DISTRIBUTION

Lecture 4: Proof of Shannon s theorem and an explicit code

Chapter 14. From Randomness to Probability. Copyright 2012, 2008, 2005 Pearson Education, Inc.

Basic Concepts and Tools in Statistical Physics

Basic Mathematics for Chemists

Towards Decision Making under General Uncertainty

Information theoretic solutions for correlated bivariate processes

Fibonacci Coding for Lossless Data Compression A Review

Cryptography and Network Security Prof. D. Mukhopadhyay Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Redundant Radix Enumeration Systems

HST.582J / 6.555J / J Biomedical Signal and Image Processing Spring 2007

Creation of a universe from thermodynamic thought experiments I

Assume that you have made n different measurements of a quantity x. Usually the results of these measurements will vary; call them x 1

Conditional Independence

Chaos, Complexity, and Inference (36-462)

Fundamentals of Information Theory Lecture 1. Introduction. Prof. CHEN Jie Lab. 201, School of EIE BeiHang University

Massachusetts Institute of Technology

Transcription:

What is Entropy? Jeff Gill, jgill@ucdavis.edu 1 Entropy in Information Theory There are many definitions of information in various literatures but all of them have the same property of distinction from a message. If a message is the physical manifestation of information transmission (language, character set, salutations, headers/trailers, body, etc.), then the information is the substantive content independent of transmission processes. For example, in baseball the third base coach transmits a message to the runner on second base through the use of handgestures. The information contained in the hand gestures is independent of the actual gesture to the base-runner. If you could condense a set of messages down to a minimal finite length, then the information content would be the number of digits required by these alternate sets of messages. In this case information and message would be identical. Information and message are never exactly the same. For example, the DNA strand required to uniquely determine the biological characteristics of a human being contains far more codes than is minimally sufficient to transmit the required genetic information. In information theory entropy is a measure of the average amount of information required to describe the distribution of some random variable of interest (Cover and Thomas 1991). More generally, information can be thought of as a measure of reduction in uncertainty given a specific message (Shannon 1948, Ayres 1994, p.44). Suppose we wanted to identify a particular voter by serial information on this person s characteristics. We are allowed to ask a consecutive set of yes/no questions (i.e. like the common guessing game). As we get answers to our series of questions we gradually converge (hopefully, depending on our skill) on the desired voter. Our first question is: does the voter reside in California. Since about 13% of voters in the United States reside in California, a yes answer gives us different information than a no answer. Restated, a yes answer reduces our uncertainty more than a no answer because a yes answer eliminates 87% of the choices whereas a no answer eliminates 13%. If P i is the probability of the i th event (residing in California), then the improvement in information is defined as: [ ] 1 I Pi = log 2 = log P 2 P i. (1) i The probability is placed in the denominator of (1) because the smaller the probability, the greater the investigative information supplied by a yes answer. The log function is required to obtain some desired properties (discussed below), and is justified by limit theorems (Bevensee 1993, Jaynes 1

1982, Van Campenhout and Cover 1981). The log is base-2 since there are only two possible answers to our question (yes and no) making the units of information bits. In this example H i = log 2 (0.13) = 2.943416 bits, whereas if we had asked: does the voter live in Arkansas then an affirmative reply would have increased our information by: H i = log 2 (0.02) = 5.643856 bits, or about twice as much. However, there is a much smaller probability that we would have gotten an affirmative reply had the question been asked about Arkansas. What Slater (1938) found, and Shannon (1948) later refined was the idea that the value of the question was the information returned by a positive response times the probability of a positive response. So the value of the i th binary-response question is just: H i = f i log 2 [ 1 f i ] = f i log 2 f i. (2) And the value of a series of n of these questions is: H i = k [ ] 1 f i log 2 = k f i f i log 2 f i (3) where f i is the frequency distribution of the i th yes answer and k is an arbitrary scaling factor that determines choice of units. The arbitrary scaling factor makes the choice of base in the logarithm unimportant since we can change this base by manipulating the constant. 1 The function (3) was actually first introduced by Slater (1938), although similar forms were referenced by Boltzmann (1877) where f i is the probability that a particle system is in microstate i. We can see that the total improvement in information is the additive value of the series of individual information improvements. So in our simple example we might ask a series of questions narrowing down on the individual of interest. Is the voter in California? Is the voter registered as a Democrat? Does the voter reside in an urban area? Is the voter female?. The total information supplied by this vector of yes/no responses is the total information improvement in units of bits since the response-space is binary. Its important to remember that the information obtained is defined only with regard to a well-defined question having finite, enumerated responses The link between information and entropy is often confusing due to differing terminology and usage in various literatures. In the thermodynamic sense, information and entropy are complementary: gain in entropy means loss in information - nothing more (Lewis 1930). So as uncertainty 1 For instance if the entropy form were expressed in terms of the natural log, but log 2 was more appropriate for the application (such as above), then setting k = 1 converts the entropy form to base 2. ln2 2

about the system increases, information about the configuration of the microstate of a system decreases. In terms of information theory, the distinction was less clear until Shannon (1948) presented the first unambiguous picture. To Shannon the function (3) represented the expected uncertainty. In a given communication, possible messages are indexed m 1,m 2,...,m k, and each have an associated probability p 1,p 2,...,p k. These probabilities sum to one, so the scaling factor is simply 1. Thus the Shannon entropy function is the negative expected value of the natural log of the probability mass function: H = k p i ln(p i ). (4) Shannon defined the information in a message as the difference between the entropy before the message and the entropy after the message. If there is no information before the message, then a uniform prior distribution 2 is assigned to the p i and entropy is at its maximum 3 In this case any result increases our information. Yet, if there is certainty about the result, then a degenerate distribution describes the m i, and the message does not change our information level. 4. So according to Shannon, the message produces a new assessment about the state of the world and this in turn leads to the assignment of new probabilities (p i ) and a subsequent updating of the entropy value. The simplicity of the Shannon entropy function belies its theoretical and practical importance. Shannon (1948, Appendix 2) showed that (4) is the only function that satisfies the following three desirable properties: 1. H is continuous in (p 1,p 2,...,p n ). 2. If the p i are uniformly distributed, then H is at its maximum and is monotonically increasing with n. 3. If a set of alternatives can be reformulated as multiple, consecutive sets of alternatives, then the first H should equal the weighted sum of the consecutive H values: H(p 1,p 2,p 3 ) = H(p 1,1 p 1 ) + (1 p 1 )H(p 2,p 3 ). These properties mean that the Shannon entropy function is well-behaved with regard to relative information comparisons. Up until this point only the discrete form of the entropy formulation 2 The uniform prior distribution as applied provides the greatest entropy since no single event is more likely to occur. Thus the uniform distribution of events provides the minimum information possible with which to decode the message. This application of the uniform distribution does not imply that this is a no information assumption since equally likely outcomes is certainly a type of information. A great deal of controversy and discussion has focused around the erroneous treatment of the uniform distribution as a zero-based information source (Dale 1991, Stigler 1986). 3 H = P 1 ln( 1 ) = ln(n), so entropy increases logarithmically with the number of equally likely alternatives. 4 n n H = P n 1 (0) log(1) = 0 3

has been discussed. Since entropy is also a measure of uncertainty in probability distributions, its important to have a formulation that applies to continuous probability distribution functions. The following entropy formula also satisfies all of the properties enumerated above: H = + f(x)ln(f(x))dx. (5) This form is often referred to as the differential entropy of a random variable X with a known probability density function f(x). There is one important distinction between (5) and (4): the discrete case measures the absolute uncertainty of a random variable given a set of probabilities, whereas the continuous case this uncertainty is measured relative to the chosen coordinate system (Ryu 1993). So transformations of the coordinate system require transformation in the probability distribution function with the associated jacobian. In addition, the continuous case can have infinite entropy unless additional, side conditions are imposed (see Jaynes 1968). There has been considerable debate about the nature of information with regard to entropy (Ayres 1994, Jaynes 1957, Ruelle 1991, Tribus 1961, 1979). A major question arises as to whose entropy is it? The sender? The receiver? The transmission equipment? These are often philosophical debates rather than practical considerations as Shannon s focus was clearly on the receiver as the unit of entropy measure. Jaynes (1982) however considers the communication channel and its limitations as the determinants of entropy. Some, like Aczél and Daróczy (1975), view entropy as a descriptor of a stochastic event, and the traditional thermodynamic definition of entropy is as a measure of uncertainty in microstates. For the purposes of this work, the Shannon formulation (4) is used. 2 References Aczél J. and Z. Daróczy. 1975. On Measures of Information and Their Characterizations. New York: Academic Press. Ayres, David. 1994. Information, Entropy, and Progress. New York: American Institute of Physics Press. Bevensee, Robert M. 1993. Maximum Entropy Solutions to Scientific Problems. Englewood Cliffs, NJ: Prentice Hall. Boltzmann, L. 1877. Uber die Beziehung zwischen dem zweiten Hauptsatze der mechanischen Warmetheorie und der Wahrscheinlichkeitsrechnung, respective den Satzen uber das Wermegleichgewicht. Wein Ber 76: 373-95. Cover, Thomas M., Joy A. Thomas. 1991. Elements of Information Theory. New York: John Wiley and Sons. Dale, Andrew I. 1991. A History of Inverse Probability: From Thomas Bayes to Karl Pearson. New York: Spring-Verlag. Jaynes, Edwin T. 1957. Information Theory and Statistical Mechanics. Physics Review 106: 620-30. 4

Jaynes, Edwin T. 1968 Prior Probabilities. IEEE Transmissions on Systems Science and Cybernetics SSC(4): 227-41. Jaynes, Edwin T. 1982. On the Rationale of Maximum-Entropy Methods. Proceedings of the IEEE 70(9): 939-52. Lewis, Gilbert N. 1930. The Symmetry of Time in Physics. Science LXXI: 569-77. Robert, Claudine. 1990. An Entropy Concentration Theorem: Applications in Artificial Intelligence and Descriptive Statistics. Journal of Applied Probability 27: 303-13. Ruelle, David. 1991. Chance and Chaos. Princeton: Princeton University Press. Ryu, Hang K. 1993. Maximum Entropy Estimation of Density and Regression Functions. Journal of Econometrics 56: 397-440. Shannon, Claude. 1948. A Mathematical Theory of Communication. Bell System Technology Journal 27: 379-423, 623-56. Stigler, Stephen M. 1986. The History of Statistics: The Measurement of Uncertainty before 1900. Cambridge, MA: Harvard University Press. Tribus, Myron. 1961. Information Theory as the Basis for Thermostatics and Thermodynamics. Journal of Applied Mechanics 28: 1-8. Tribus, Myron. 1979. Thirty Years of Information Theory. In The Maximum Entropy Formalism, eds. Raphael D. Levine & Myron Tribus. Cambridge: MIT Press. Van Campenhout, Jan M. and Thomas M Cover. 1981. Maximum Entropy and Conditional Probability IEEE Transactions on Information Theory 27(4): 483-9. 5