Principle of Maximum Entropy - PDF Free Download

Chapter 9 Prncple of Maxmum Entropy Secton 8.2 presented the technque of estmatng nput probabltes of a process that are unbased but consstent wth known constrants expressed n terms of averages, or expected values, of one or more quanttes. Ths technque, the Prncple of Maxmum Entropy, was developed there for the smple case of one constrant and three nput events, n whch case the technque can be carred out analytcally. It s descrbed here for the more general case. 9. Problem Setup Before the Prncple of Maxmum Entropy can be used the problem doman needs to be set up. In cases nvolvng physcal systems, ths means that the varous states n whch the system can exst need to be dentfed, and all the parameters nvolved n the constrants known. For example, the energy, electrc charge, and other quanttes assocated wth each of the quantum states s assumed known. It s not assumed n ths step whch partcular state the system s actually n (whch state s occuped ). Indeed t s assumed that we cannot ever know ths wth certanty, and so we deal nstead wth the probablty of each of the states beng occuped. In applcatons to nonphyscal systems, the varous possble events have to be enumerated and the propertes of each determned, partcularly the values assocated wth each of the constrants. In ths Chapter we wll apply the general mathematcal dervaton to two examples, one a busness model, and the other a model of a physcal system (both very smple and crude). 9.. Berger s Burgers Ths example was used n Chapter 8 to deal wth nference and the analytc form of the Prncple of Maxmum Entropy. A fast-food restaurant offers three meals: burger, chcken, and fsh. Now we suppose that the menu has been extended to nclude a gourmet low-fat tofu meal. The prce, Calore count, and probablty of each meal beng delvered cold are lsted n Table 9.. 9..2 Magnetc Dpole Model An array of magnetc dpoles (thnk of them as tny magnets) are subjected to an externally appled magnetc feld H and therefore the energy of the system depends on ther orentatons and on the appled feld. For smplcty our system contans only one such dpole, whch from tme to tme s able to nterchange nformaton and energy wth ether of two envronments, whch are much larger collectons of dpoles. Each 04

9. Problem Setup 05 Item Entree Cost Calores Probablty Probablty of arrvng hot of arrvng cold Meal Burger $.00 000 0.5 0.5 Meal 2 Chcken $2.00 600 0.8 0.2 Meal 3 Fsh $3.00 400 0.9 0. Meal 4 Tofu $8.00 200 0.6 0.4 Table 9.: Berger s Burgers Left Envronment System Rght Envronment H Fgure 9.: Dpole moment example. (Each dpole can be ether up or down.) dpole, both n the system and n ts two envronments, can be ether up or down. The system has one dpole so t only has two states, correspondng to the two states for that dpole, up and down (f the system had n dpoles t would have 2 n states). The energy of each dpole s proportonal to the appled feld and depends on ts orentaton; the energy of the system s the sum of the energes of all the dpoles n the system, n our case only one such. State Algnment Energy U D up down m d H m d H Table 9.2: Magnetc Dpole Moments The constant m d s expressed n Joules per Tesla, and ts value depends on the physcs of the partcular dpole. For example, the dpoles mght be electron spns, n whch case m d = 2µ B µ 0 where µ 0 = 4π 0 7 henres per meter (n ratonalzed MKS unts) s the permeablty of free space, µ B = e/2m e = 9.272 0 24 Joules per Tesla s the Bohr magneton, and where = h/2π, h = 6.626 0 34 Joule-seconds s Plank s constant, e =.602 0 9 coulombs s the magntude of the charge of an electron, and m e = 9.09 0 3 klograms s the rest mass of an electron. In Fgure 9., the system s shown between two envronments, and there are barrers between the envronments and the system (represented by vertcal lnes) whch prevent nteracton (later we wll remove the barrers to permt nteracton). The dpoles, n both the system and the envronments, are represented by the symbol and may be ether spn-up or spn-down. The magnetc feld shown s appled to the system only, not to the envronments. The vrtue of a model wth only one dpole s that t s smple enough that the calculatons can be carred out easly. Such a model s, of course, hopelessly smplstc and cannot be expected to lead to numercally accurate results. A more realstc model would requre so many dpoles and so many states that practcal computatons on the collecton could never be done. For example, a mole of a chemcal element s a small amount by everyday standards, but t contans Avogadro s number N A = 6.02252 0 23 of atoms, and a correspondngly large number of electron spns; the number of possble states would be 2 rased to that power. Just how large ths number s can be apprecated by notng that the earth contans no more than 2 70 atoms, and the vsble unverse has about 2 265 atoms; both of these numbers are way less than the number of states n that model. Even f we are less ambtous and want to compute wth a much smaller sample, say

9.2 Probabltes 06 200 spns, and want to represent n our computer the probablty of each state (usng only 8 bts per state), we would stll need more bytes of memory than there are atoms n the earth. Clearly t s mpossble to compute wth so many states, so the technques descrbed n these notes cannot be carred through n detal. Nevertheless there are certan conclusons and general relatonshps we wll be able to establsh. 9.2 Probabltes Although the problem has been set up, we do not know whch actual state the system s n. To express what we do know despte ths gnorance, or uncertanty, we assume that each of the possble states A has some probablty of occupancy p(a ) where s an ndex runnng over the possble states. A probablty dstrbuton p(a ) has the property that each of the probabltes s between 0 and (possbly beng equal to ether 0 or ), and (snce the nput events are mutually exclusve and exhaustve) the sum of all the probabltes s : = p(a ) (9.) As has been mentoned before, two observers may, because of ther dfferent knowledge, use dfferent probablty dstrbutons. In other words, probablty, and all quanttes that are based on probabltes, are subjectve, or observer-dependent. The dervatons below can be carred out for any observer. 9.3 Entropy Our uncertanty s expressed quanttatvely by the nformaton whch we do not have about the state occuped. Ths nformaton s S = p(a ) log 2 (9.2) p(a ) Informaton s measured n bts, as a consequence of the use of logarthms to base 2 n the Equaton 9.2. In dealng wth real physcal systems, wth a huge number of states and therefore an entropy that s a very large number of bts, t s convenent to multply the summaton above by Boltzmann s constant k B =.38 0 23 Joules per Kelvn, and also use natural logarthms rather than logarthms to base 2. Then S would be expressed n Joules per Kelvn: S = k B p(a ) ln (9.3) p(a ) In the context of both physcal systems and communcaton systems the uncertanty s known as the entropy. Note that because the entropy s expressed n terms of probabltes, t also depends on the observer, so two people wth dfferent knowledge of the system would calculate a dfferent numercal value for entropy. 9.4 Constrants The entropy has ts maxmum value when all probabltes are equal (we assume the number of possble states s fnte), and the resultng value for entropy s the logarthm of the number of states, wth a possble scale factor lke k B. If we have no addtonal nformaton about the system, then such a result seems reasonable. However, f we have addtonal nformaton n the form of constrants then the assumpton of equal probabltes would probably not be consstent wth those constrants. Our objectve s to fnd the probablty dstrbuton that has the greatest uncertanty, and hence s as unbased as possble. For smplcty we consder only one such constrant here. We assume that we know the expected value of some quantty (the Prncple of Maxmum Entropy can handle multple constrants but the mathematcal

9.5 Maxmum Entropy, Analytc Form 07 procedures and formulas are more complcated). The quantty n queston s one for whch each of the states of the system has ts own amount, and the expected value s found by averagng the values correspondng to each of the states, takng nto account the probabltes of those states. Thus f there s a quantty G for whch each of the states has a value g(a ) then we want to consder only those probablty dstrbutons for whch the expected value s a known value G G = p(a )g(a ) (9.4) Of course ths constrant cannot be acheved f G s less than the smallest g(a ) or greater than the largest g(a ). 9.4. Examples For our Berger s Burgers example, suppose we are told that the average prce of a meal s $2.50, and we want to estmate the separate probabltes of the varous meals wthout makng any other assumptons. Then our constrant would be $2.50 = $.00p(B) + $2.00p(C) + $3.00p(F ) + $8.00p(T ) (9.5) For our magnetc-dpole example, assume the energes for states U and D are denoted e() where s ether U or D, and assume the expected value of the energy s known to be some value Ẽ. All these energes are expressed n Joules. Then Ẽ = e(u)p(u) + e(d)p(d) (9.6) The energes e(u) and e(d) depend on the externally appled magnetc feld H. Ths parameter, whch wll be carred through the dervaton, wll end up playng an mportant role. If the formulas for the e() from Table 9.2 are used here, 9.5 Maxmum Entropy, Analytc Form Ẽ = m d H[p(D) p(u)] (9.7) The Prncple of Maxmum Entropy s based on the premse that when estmatng the probablty dstrbuton, you should select that dstrbuton whch leaves you the largest remanng uncertanty (.e., the maxmum entropy) consstent wth your constrants. That way you have not ntroduced any addtonal assumptons or bases nto your calculatons. Ths prncple was used n Chapter 8 for the smple case of three probabltes and one constrant. The entropy could be maxmzed analytcally. Usng the constrant and the fact that the probabltes add up to, we expressed two of the unknown probabltes n terms of the thrd. Next, the possble range of values of the probabltes was determned usng the fact that each of the three les between 0 and. Then, these expressons were substtuted nto the formula for entropy S so that t was expressed n terms of a sngle probablty. Then any of several technques could be used to fnd the value of that probablty for whch S s the largest. Ths analytcal technque does not extend to cases wth more than three possble states and only one constrant. It s only practcal because the constrant can be used to express the entropy n terms of a sngle varable. If there are, say, four unknowns and two equatons, the entropy would be left as a functon of two varables, rather than one. It would be necessary to search for ts maxmum n a plane. Perhaps ths seems feasble, but what f there were fve unknowns? (Or ten?) Searchng n a space of three (or eght) dmensons would be necessary, and ths s much more dffcult. A dfferent approach s developed n the next secton, one well suted for a sngle constrant and many probabltes.

9.6 Maxmum Entropy, Sngle Constrant 08 9.6 Maxmum Entropy, Sngle Constrant Let us assume the average value of some quantty wth values g(a ) assocated wth the varous events A s known; call t G (ths s the constrant). Thus there are two equatons, one of whch comes from the constrant and the other from the fact that the probabltes add up to : = p(a ) (9.8) G = p(a )g(a ) (9.9) where G cannot be smaller than the smallest g(a ) or larger than the largest g(a ). The entropy assocated wth ths probablty dstrbuton s S = p(a ) log 2 (9.0) p(a ) when expressed n bts. In the dervaton below ths formula for entropy wll be used. It works well for examples wth a small number of states. In later chapters of these notes we wll start usng the more common expresson for entropy n physcal systems, expressed n Joules per Kelvn, S = k B p(a ) ln (9.) p(a ) 9.6. Dual Varable Sometmes a problem s clarfed by lookng at a more general problem of whch the orgnal s a specal case. In ths case, rather than focusng on a specfc value of G, let s look at all possble values of G, whch means the range between the smallest and largest values of g(a ). Thus G becomes a varable rather than a known value (the known value wll contnue to be denoted G here). Then rather than express thngs n terms of G as an ndependent varable, we wll ntroduce a new dual varable, whch we wll call β, and express all the quanttes of nterest, ncludng G, n terms of t. Then the orgnal problem reduces to fndng the value of β whch corresponds to the known, desred value G,.e., the value of β for whch G(β) = G. The new varable β s known as a Lagrange Multpler, named after the French mathematcan Joseph- Lous Lagrange (736 83). Lagrange developed a general technque, usng such varables, to perform constraned maxmzaton, of whch our current problem s a very smple case. We wll not use the mathematcal technque of Lagrange Multplers t s more powerful and more complcated than we need. Here s what we wll do nstead. We wll start wth the answer, whch others have derved usng Lagrange Multplers, and prove that t s correct. That s, we wll gve a formula for the probablty dstrbuton p(a ) n terms of the β and the g(a ) parameters, and then prove that the entropy calculated from ths dstrbuton, S(β) s at least as large as the entropy of any probablty dstrbuton that has the same expected value for G, namely G(β). Therefore the use of β automatcally maxmzes the entropy. Then we wll show how to fnd the value of β, and therefore ndrectly all the quanttes of nterest, for the partcular value G of nterest (ths wll be possble because G(β) s a monotonc functon of β so calculatng ts nverse can be done wth zero-fndng technques). 9.6.2 Probablty Formula The probablty dstrbuton p(a ) we want has been derved by others. It s a functon of the dual varable β: p(a ) = 2 α 2 βg(a) (9.2) See a bography of Lagrange at http://www-groups.dcs.st-andrews.ac.uk/ hstory/bographes/lagrange.html

9.6 Maxmum Entropy, Sngle Constrant 09 whch mples log 2 = α + βg(a ) (9.3) p(a ) where α s a convenent abbrevaton 2 for ths functon of β: α = log 2 2 βg(a) (9.4) Note that ths formula for α guarantees that the p(a ) from Equaton 9.2 add up to as requred by Equaton 9.8. If β s known, the functon α and the probabltes p(a ) can be found and, f desred, the entropy S and the constrant varable G. In fact, f S s needed, t can be calculated drectly, wthout evaluatng the p(a ) ths s helpful f there are dozens or more probabltes to deal wth. Ths short-cut s found by multplyng Equaton 9.3 by p(a ), and summng over. The left-hand sde s S and the rght-hand sde smplfes because α and β are ndependent of. The result s where S, α, and G are all functons of β. S = α + βg (9.5) 9.6.3 The Maxmum Entropy It s easy to show that the entropy calculated from ths probablty dstrbuton s at least as large as that for any probablty dstrbuton whch leads to the same expected value of G. Recall the Gbbs nequalty, Equaton 6.4, whch wll be rewrtten here wth p(a ) and p (A ) nterchanged (t s vald ether way): p (A ) log 2 p p (A ) log (A ) 2 (9.6) p(a ) where p (A ) s any probablty dstrbuton and p(a ) s any other probablty dstrbuton. The nequalty s an equalty f and only f the two probablty dstrbutons are the same. The Gbbs nequalty can be used to prove that the probablty dstrbuton of Equaton 9.2 has the maxmum entropy. Suppose there s another probablty dstrbuton p (A ) that leads to an expected value G and an entropy S,.e., = p (A ) (9.7) G = p (A )g(a ) (9.8) S = p (A ) log 2 p (A ) (9.9) Then t s easy to show that, for any value of β, f G = G(β) then S S(β): 2 The functon α(β) s related to the partton functon Z(β) of statstcal physcs: Z = 2 α or α = log 2 Z.

9.6 Maxmum Entropy, Sngle Constrant 0 S = p (A ) log 2 p (A ) p (A ) log 2 p(a ) = p (A )[α + βg(a )] = α + βg = S(β) + β[g G(β)] (9.20) where Equatons 9.6, 9.3, 9.7, 9.8, and 9.5 were used. Thus the entropy assocated wth any alternatve proposed probablty dstrbuton that leads to the same value for the constrant varable cannot exceed the entropy for the dstrbuton that uses β. 9.6.4 Evaluatng the Dual Varable So far we are consderng the dual varable β to be an ndependent varable. If we start wth a known value G, we want to use G as an ndependent varable and calculate β n terms of t. In other words, we need to nvert the functon G(β), or fnd β such that Equaton 9.9 s satsfed. Ths task s not trval; n fact most of the computatonal dffculty assocated wth the Prncple of Maxmum Entropy les n ths step. If there are a modest number of states and only one constrant n addton to the equaton nvolvng the sum of the probabltes, ths step s not hard, as we wll see. If there are more constrants ths step becomes ncreasngly complcated, and f there are a large number of states the calculatons cannot be done. In the case of more realstc models for physcal systems, ths summaton s mpossble to calculate, although the general relatons among the quanttes other than p(a ) reman vald. To fnd β, start wth Equaton 9.2 for p(a ), multply t by g(a ) and by 2 α, and sum over the probabltes. The left hand sde becomes G(β)2 α, because nether α nor G(β) depend on. We already have an expresson for α n terms of β (Equaton 9.4), so the left hand sde becomes G(β)2 βg(a). The rght hand sde becomes g(a )2 βg(a). Thus, 0 = [g(a ) G(β)]2 βg(a) (9.2) If ths equaton s multpled by 2 βg(β), the result s where the functon f(β) s 0 = f(β) (9.22) f(β) = [g(a ) G(β)]2 β[g(a) G(β)] (9.23) Equaton 9.22 s the fundamental equaton that s to be solved for partcular values of G(β), for example G. The functon f(β) depends on the model of the problem (.e., the varous g(a )), and on G, and that s all. It does not depend explctly on α or the probabltes p(a ). How do we know that there s any value of β for whch f(β) = 0? Frst, notce that snce G les between the smallest and the largest g(a ), there s at least one for whch (g(a ) G ) s postve and at least one for whch t s negatve. It s not dffcult to show that f(β) s a monotonc functon of β, n the sense that f β 2 > β then f(β 2 ) < f(β ). For large postve values of β, the domnant term n the sum s the one that has the smallest value of g(a ), and hence f s negatve. Smlarly, for large negatve values of β, f s postve. It must therefore be zero for one and only one value of β (ths reasonng reles on the fact that f(β) s a contnuous functon.)

9.6 Maxmum Entropy, Sngle Constrant 9.6.5 Examples For the Berger s Burgers example, suppose that you are told the average meal prce s $2.50, and you want to estmate the probabltes p(b), p(c), p(f ), and p(t ). Here s what you know: = p(b) + p(c) + p(f ) + p(t ) (9.24) 0 = $.00p(B) + $2.00p(C) + $3.00p(F ) + $8.00p(T ) $2.50 (9.25) S = p(b) log 2 + p(c) log 2 + p(f ) log 2 + p(t ) log p(b) p(c) p(f ) 2 p(t ) (9.26) The entropy s the largest, subject to the constrants, f where and β s the value for whch f(β) = 0 where p(b) = 2 α 2 β$.00 (9.27) p(c) = 2 α 2 β$2.00 (9.28) p(f ) = 2 α 2 β$3.00 (9.29) p(t ) = 2 α 2 β$8.00 (9.30) α = log 2 (2 β$.00 + 2 β$2.00 + 2 β$3.00 + 2 β$8.00 ) (9.3) f(β) = $0.50 2 $0.50β + $5.50 2 $5.50β $.50 2 $.50β $0.50 2 $0.50β (9.32) A lttle tral and error (or use of a zero-fndng program) gves β = 0.2586 bts/dollar, α =.237 bts, p(b) = 0.3546, p(c) = 0.2964, p(f ) = 0.2478, p(t ) = 0.0, and S =.8835 bts. The entropy s smaller than the 2 bts whch would be requred to encode a sngle order of one of the four possble meals usng a fxed-length code. Ths s because knowledge of the average prce reduces our uncertanty somewhat. If more nformaton s known about the orders then a probablty dstrbuton that ncorporates that nformaton would have even lower entropy. For the magnetc dpole example, we carry the dervaton out wth the magnetc feld H set at some unspecfed value. The results all depend on H as well as E. = p(u) + p(d) (9.33) Ẽ = e(u)p(u) + e(d)p(d) = m d H[p(U) p(d)] (9.34) S = p(u) log 2 + p(d) log p(a) 2 p(d) (9.35) The entropy s the largest, for the energy Ẽ and magnetc feld H, f p(u) = 2 α 2 βm dh p(d) = 2 α 2 βm dh (9.36) (9.37)

9.6 Maxmum Entropy, Sngle Constrant 2 where and β s the value for whch f(β) = 0 where α = log 2 (2 βm d H + 2 βm dh ) (9.38) f(β) = (m d H Ẽ)2 β(m dh E e ) (m d H + Ẽ)2 β(m d H+E e ) (9.39) Note that ths example wth only one dpole, and therefore only two states, does not actually requre the Prncple of Maxmum Entropy because there are two equatons n two unknowns, p(u) and p(d) (you can solve Equaton 9.39 for β usng algebra). If there were two dpoles, there would be four states and algebra would not have been suffcent. If there were many more than four possble states, ths procedure to calculate β would have been mpractcal or at least very dffcult. We therefore ask, n Chapter of these notes, what we can tell about the varous quanttes even f we cannot actually calculate numercal values for them usng the summaton over states.

MIT OpenCourseWare http://ocw.mt.edu 6.050J / 2.0J Informaton and Entropy Sprng 2008 For nformaton about ctng these materals or our Terms of Use, vst: http://ocw.mt.edu/terms.