1 Introduction Maximum Entropy - Lecture 8 In the previous discussions we have not been particularly concerned with the selection of the initial prior probability to use in Bayes theorem, as the result in most cases is insensitive to initial values. This is not always the case, however, and we need a way of selecting initial probabilities among many possible choices. We can use the principle of maximum entropy which is defined by information theory. Thus information entropy is a measure of the uncertainty in a probability distribution. A data stream is an information transfer to which we can assign a set of possible probability distributions. The principle of maximum entropy states that the set of possible probability distributions, {p i }, occurs in a way to maximize the entropy. As an example, supppose 12 balls and all but one are of equal weight. You have a balance as an instrument and wish to determine the best way to do find the odd ball. This is a problem in optimization and is solved by choosing the most probable path. 2 Review of information theory Shannon s theorem states that the information contained in a probability, p i, contained in a set {p i } is I = ln(1/p i ). The entropy of the ensemble of the set is; S = p i ln(1/p i ) As previously noted, the ln function could be defined in any base, but in information theory this is almost always base 2 as information is coded in bits. Thus N random variables each with entropy, S, can be compressed without loss of information into NS or more bits. Consider an ensemble of N independent distributed random variables. The outcome of this distribution, x = (x 1 x N ), most likely belongs to the subset, A x, having 2 NS members, and the probability is close to 1/2 NS. This is due to the equipartition of probabilities. Thus as N the number of bits per outcome required to specify an outcome S. (ie S δ /N Constant as N In the above S δ is the smallest subset of outcomes that have the highest probabilities (the largest information content). This is shown in Figure 1. 1
Figure 1: A plot of S δ /N (vertical axis) vs δ (horizontal axis) for various N 3 Application of information theory Suppose a gas of N point particles. Initially all particles are contained in a volume V. Suppose we know the position of each particle. Divide the volume into 10 6 small volumes, δv. For each of the 10 6 volumes assign a probability of p i of finding a gas particle within the volume. Thus the probability is p i = n i /N. For an even distribution n i = N/10 6. The entropy is; S = 106 i=1 p i ln(1/p i ) = (n i /N) ln(n/n i ) In the case of an even distribution; S = N/10 6 n ln( N N/10 6 ) = ln(10 6 ) Note that the entropy depends on the measurement scale. If the volume size is decreased by 10 3 (ie the accuracy in the measurement of the particle s position increases) the entropy increases to; S = ln(10 9 ) For macroscopic systems we most often use relative rather than absolute measures of entropy, due to the large number of particles. Now allow the particles to move. The state of the system is determined by the the distribution of particles in each volume. We assume that particles are equally likely to move in any box in any given time step. There are 10 6 configurations with all particles in one box. 2
In this case the relative entropy is; S(all in 1 box) = 106 i=1 p i ln(1/p i ) = 0 This is because p 1 = n i /10 6 = 1 because n i = 10 6 and p i ln(1/p i ) = 0 when p i = 0. Now suppose the particles are put in 2 boxes. The number of distributions are; ( ) 10 6 = 10 6! 2 2! (10 6 2)! 5 1011 The entropy in this case is; S(all in 2 boxes) = 1/2 ln(2) + 1/2 ln(2) = ln(2) There are 5 10 11 distributions out of 10 6 possible boxes. The probability is; P = 5 10 11 [5x10 11 + 10 6 ] 1 10 5 Suppose the particles deitribute almost equally with 1/2 the boxes having one less particle and 1/2 having one more particle. The number of configurations are; ( ) 10 6 10 6 10 /2 3 105 Each of these configurations has entropy ln(10 6. The result of all this is that we started a system with entropy 0 an the probability that it moves to a higher configuration is; P = 1 (1/10 3 105 ) 1 It is owhelmingly probable that as time passes a macroscopic syatem will increase in entropy to reach a maximum. This provides an explanation (if not a proof) of the 2 nd law of thermodynamics and the foundation of the maximum entropy principle. 4 Example - Texas armadillos and dosequis We cannot explore the application of maximum entropy in detail, but can give an illustrative example. We look at Texas armadillos distributed as to whether they are left handed and drink Dos Equis beer 2. [paraphrased from Jaynes - Maximum Entropy and Bayesian Methods] Suppose observation establishes that 3/4 of the armadillos in Texas are left-handed and 3/4 drink Dos Equis. We fill in the data table 1. Armadillos come in quantized units, so for a 3
Figure 2: A Dos Equis drinking Armadillo Table 1: Probability table for Texas Armadillos Beer Left Handed Right Handed Probability Dos Equis p 11 p 12 3/4 Other p 21 p 22 1/4 Total 3/4 1/4 1 total of N armadillos we have probabilities; P ij N ij /N which leads to the equations; p 11 + p 12 = 3/4 p 21 + p 22 = 1/4 p 11 + p 21 = 3/4 p 12 + p 22 = 1/4 Solve these equations letting p 22 = q, and then put the table in the form; ( 0.5 + q ) 0.25 q 0.25 q q We find that p 12 = p 21. It can also be shown that when assigning probabilities, maximizing entropy is the only choice that does not introduce correlations in the variables. The number of armadillos in Texas is large, so there is a large number of possible probabilities. We count these using a binomial distribution; W = N! N 11! N 12! N 21! N 22! As an example, choose N = 4. There are 2 possible solutions; 4
Solution 1 q = 0 ( ) 2 1 (1/4) 1 0 This has multiplicity W = 12, and entropy S = p i ln(p i ) = 0.45 Solution 2 q = 1/N ( ) 3 0 (1/4) 0 1 This has multiplicity W = 4 and entropy S = p i ln(p i ) = 0.24. The solution with greatest entropy occurs 75% of the time ( 12 12 + 4 ). Now suppose we change the analysis by introducing a connection between drinking Dos Equis and handiness - a gene perhaps. This introduces a constraint and a correlation. The solution can be developed as above and these would be iterated in a Bayesian approach with the above prior. 5